# Quick Analysis Tool

<a href = "http://myy.haaga-helia.fi/~taaak/">Aki Taanila</a> has written a <a href="https://nbviewer.org/github/taanila/tilastoapu/blob/master/pika.ipynb">Quick Analysis Tool with Python</a> (Finnish notebook). The result of the tool is an Excel file where you find frequency tables, cross tabulations and essential statistical key figures for analyzing your research data. In this section the Quick Analysis Tool is briefly explained.

- You can use your own data below. If you want to read some basic properties of Python and data analytics, read e.g. <a href="http://myy.haaga-helia.fi/~nurju/Teaching/#data-analytics-basics">Basics of Data Analytics with Python</a>.
- Define types of variables as Python's lists. The types are unnecessary variables, categorical variables, opinion scale variables nad multiple choice variables. Use the empty list [] (mere brackets) if a variable type is missing in your data.
- After running the Python code you find the resulting Excel file in the same directory (folder) where you saved the ipynb-file of the Quick Analysis Tool. By default it can be found in the directory C:\Users\userid.

__Note__: In the following code the Python module __xlsxwriter__ is not installed in Anaconda by default. It can be installed using Anaconda Navigator or, alternatively, Anaconda command prompt with command

_conda install -c anaconda xlsxwriter_

<a href = "http://myy.haaga-helia.fi/~taaak/">Aki Taanila</a>'s <a href="https://nbviewer.org/github/taanila/tilastoapu/blob/master/pika.ipynb">Quick Analysis Tool with Python</a>:

In [8]:
import pandas as pd

# Change your own data here to be fetch from the URL inside the apostrophes

df = pd.read_excel('http://myy.haaga-helia.fi/~menetelmat/Data-analytiikka/Teaching/data1_en.xlsx')

df

Unnamed: 0,number,sex,age,family,education,empl_years,salary,management,colleagues,environment,salary_level,duties,occu_health,timeshare,gym,massage
0,1,1,38,1,1.0,22.0,3587,3,3.0,3,3,3,,,,
1,2,1,29,2,2.0,10.0,2963,1,5.0,2,1,3,,,,
2,3,1,30,1,1.0,7.0,1989,3,4.0,1,1,3,1.0,,,
3,4,1,36,2,1.0,14.0,2144,3,3.0,3,3,3,1.0,,,
4,5,1,24,1,2.0,4.0,2183,2,3.0,2,1,2,1.0,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
77,78,1,22,1,3.0,0.0,1598,4,4.0,4,3,4,,1.0,1.0,
78,79,1,33,1,1.0,2.0,1638,1,3.0,2,1,2,1.0,,,
79,80,1,27,1,2.0,7.0,2612,3,4.0,3,3,3,1.0,,1.0,
80,81,1,35,2,2.0,16.0,2808,3,4.0,3,3,3,,,,


In [9]:
### Define the lists; if necessary, leave them empty (mere brackets [])

unnecessary = ['number']

categorical = ['sex', 'family', 'education']

opinion_scaled = ['management', 'colleagues', 'environment', 'salary_level', 'duties']

multiple_choice = ['occu_health', 'timeshare', 'gym', 'massage']

In [10]:
### Preliminary actions

df = df.drop(unnecessary, axis=1) ### Removing unnecessary variables

categorical2 = categorical + opinion_scaled ### Extended list including categorical

if categorical + multiple_choice:
    df_quantitative = df.drop(categorical + multiple_choice, axis=1)
    quantitative = df_quantitative.columns # List of quantitative variables

### Prepare ExcelWriter to write to a new file
writer = pd.ExcelWriter('quick_analysis.xlsx', engine='xlsxwriter')

### Percentage format, used in frequency tables cross-tabulations

format = writer.book.add_format({'num_format':'0.0 %'})

### Frequency tables

if categorical2:
    
    ### We keep a running tally of Excel's row number in the variable called row
    
    row = 0
    
    ### We go through all the categorical variables using the for loop
    for var in categorical2:
        # Calculate frequencies into dataframe named df1
        df1 = pd.crosstab(df[var], 'f')
        # Calculate percentages into df1
        df1['f (%)'] = df1/df1.sum()
        # Add Total row into df1
        df1.loc['Total'] = df1.sum()
        # Write df1 into the Frequencies sheet of the Excel file 
        df1.to_excel(writer, sheet_name = 'Frequencies', startrow = row)
        # Increase the row number; shape[0] gives the number of rows of the dataframe df1
        row = row + df1.shape[0] + 2
    # Add percentage format into the column C
    writer.sheets['Frequencies'].set_column('C:C', cell_format=format)

### Cross-tabulations
    
    row = 3
    instruction = 'All cross-tabulations for categorical and opinion scale variables. Percentages are calculated from column total (n).'

    for var1 in categorical2:
        for var2 in categorical2:
            if var1 != var2:
                df1 = pd.crosstab(df[var1], df[var2])
                df2 = pd.crosstab(df[var1], df[var2], normalize='columns')
                df2.index.name=var1+'/'+var2 
                df2.loc['n'] = df1.sum()
                df2.to_excel(writer, sheet_name='Cross-tabulations', startrow=row)
                for i in range(row+1, row+df2.shape[0]):
                    writer.sheets['Cross-tabulations'].set_row(i, cell_format=format)
                row = row + df2.shape[0] + 2
    worksheet = writer.sheets['Cross-tabulations']
    worksheet.write(0, 0, instruction)
            

### Statistical key figures
            
### Statistical key figures for quantitative and opinion scaled variables to be written in the sheet named Statistical key figures

df1 = df_quantitative.describe()
df1.to_excel(writer, sheet_name = 'Statistical key figures')


### Statistical key figures in groups defined by categorical variables
if categorical2:
    row = df1.shape[0]+2
    for var1 in categorical2:
        for var2 in quantitative:
            if var1 != var2:
                df1 = df.groupby(var1)[var2].describe()
                df1.index.name = var1+'/'+var2
                df1.to_excel(writer, sheet_name = 'Statistical key figures', startrow = row)
                row = row + df1.shape[0]+2


### Correlations
                
### Correlation coefficients to be written in the sheet named Correlations
df1 = df_quantitative.corr()
df1.to_excel(writer, sheet_name = 'Correlations')


### Correlation coefficients in groups defined by categorical variables
if categorical2:
    row = df1.shape[0]+2
    for var1 in categorical2:
        df1 = df.groupby(var1)[quantitative].corr()
        df1.to_excel(writer, sheet_name = 'Correlations', startrow = row)
        row = row + df1.shape[0]+2

        
### Multiple choices

instruction = 'Percentages are calculated from the n-value (the size of the group).'
### Multiple choices to the sheet named Multiple choices

if multiple_choice:
    row = 3
    column = len(multiple_choice)+2
    for var in multiple_choice:
        df1 = df[multiple_choice].sum().to_frame()
        df1 = df1.rename(columns = {0:'f'})
        df2 = pd.DataFrame()
        df2['% of responses'] = df1['f']/df.shape[0]
        df2.loc['n','% of responses'] = df.shape[0]
        df1.T.to_excel(writer, sheet_name = 'Multiple choices', startrow=row)
        df2.T.to_excel(writer, sheet_name = 'Multiple choices', startrow=row, startcol=column)
        writer.sheets['Multiple choices'].set_column(column+1,column+len(multiple_choice), cell_format=format)
    
### Multiple choices in groups defined by categorical variables
    row = row + df1.T.shape[0]+2
    for var1 in categorical2:
        df1 = df.groupby(var1)[multiple_choice].sum()
        df1.to_excel(writer, sheet_name = 'Multiple choices', startrow=row)
        for value in df[var1].dropna().unique():
            value_total=df[var1].value_counts()[value]
            df1.loc[value]=df1.loc[value]/value_total
            df1.loc[value, 'n']=value_total
            df1.to_excel(writer, sheet_name = 'Multiple choices', startrow=row, startcol=column)
        row = row + df1.shape[0]+2
    worksheet = writer.sheets['Multiple choices']
    worksheet.write(0, 0, instruction)


# Saving the Excel file

writer.save()

Source: Aki Taanila, https://nbviewer.org/github/taanila/tilastoapu/blob/master/pika.ipynb