# Introduction to Summary Statistics and Hypothesis Testing in Python

In this section, we will take you through some of the main concepts in statistics. This tutorial should only be taken after the student has completed the theoretical part of the course, as here we will focus on applying those concepts using Python.

Before analysing any data, it's important to distinguish between the different types of variables that you might come across in your projects.

![datatypes.png]  

source: https://www.graphpad.com/

Generally speaking, we can split variables into two types: numerical or categorical. It is important to consider these distinctions before we analyse our data, as specific statistical tests often do not work on all variable types. In fact, some of the statistical tests assume that you have a certain type of data, so if your data does not meet those assumptions, the results of the statistical tests will not be accurate.

Thus, one of the first steps in your data analysis should be to look at your variable types and think what analysis would be most appropiate for them. Thankfully Python makes it very easy for us to do this. But first, let's load a dataset.

In [7]:
import pandas as pd
import os

In [12]:
path = os.getcwd() # this gives us the current working directory
path = path + '/Data/Day9_data.csv'
df = pd.read_csv(path) #Loads our data in a dataframe



Now that the data is loaded, we can easily inspect our variables using `head()` and `dtypes`. The former function shows you the first 5 rows of your dataframe so you can inspect what it is that they are measuring. The latter shows us the data type of each variable. Note that `dtypes` is not a function because pandas dataframe contain built in information of what variable type, so you just need to access that information.

In [13]:
df.head()

Unnamed: 0,id,danceability,energy,loudness,speechiness,duration_ms,acousticness,instrumentalness,liveness,valence,genre,song_name
0,2Vc6NJ9PW9gD9q343XFRKx,0.831,0.814,-7.364,0.42,124539,0.0598,0.0134,0.0556,0.389,Dark Trap,Mercury: Retrograde
1,7pgJBLVz5VmnL7uGHmRj6p,0.719,0.493,-7.23,0.0794,224427,0.401,0.0,0.118,0.124,Dark Trap,Pathology
2,0vSWgAlfpye0WCGeNmuNhy,0.85,0.893,-4.783,0.0623,98821,0.0138,4e-06,0.372,0.0391,Dark Trap,Symbiote
3,0VSXnJqQkwuH2ei1nOQ1nu,0.476,0.781,-4.71,0.103,123661,0.0237,0.0,0.114,0.175,Dark Trap,ProductOfDrugs (Prod. The Virus and Antidote)
4,4jCeguq9rMTlbMmPHuO7S3,0.798,0.624,-7.668,0.293,123298,0.217,0.0,0.166,0.591,Dark Trap,Venom


In [15]:
df.dtypes

id                   object
danceability        float64
energy              float64
loudness            float64
speechiness         float64
duration_ms           int64
acousticness        float64
instrumentalness    float64
liveness            float64
valence             float64
genre                object
song_name            object
dtype: object

Sometimes, when we load our data, the variables get saved as the wrong data type. In those cases, we want a quick and easy way to tell Python to change their data type. This can be done via the `astype()` method, where you need to specify what data type you want for you variable instead.

In [20]:
df['genre'] = df['genre'].astype('category') # Here we changed the type of genre from object to categorical
df.dtypes

id                    object
danceability         float64
energy               float64
loudness             float64
speechiness          float64
duration_ms            int64
acousticness         float64
instrumentalness     float64
liveness             float64
valence              float64
genre               category
song_name             object
dtype: object

Now if you print that column, you can see what are all the genres found in this dataset

In [None]:
df.genre

## Summarising numerical data

The most common methods of summarising numerical data are the `mean()`, `median()` and `mode()`. We might also be interested in measures of variance or standard deviation, which can be computed via `var()` and `std()` respectively. Let's have a look at how we can use these to find out about the duration of a song.

In [25]:
print(df.duration_ms.mean())
print(df.duration_ms.median())
print(df.duration_ms.mode())
print(df.duration_ms.var())
print(df.duration_ms.std())



194521.72884427715
191587.0
0    192000
Name: duration_ms, dtype: int64
3645313674.36615
60376.43310403613


Note that dataframes have built in methods such as `mean()` which allow us to compute summary variables with ease. However, if you work with other data structures that might not be the case. In those scenarios, you can use the `numpy` module that we showed you earlier. The two are equivalent, so it is up to you which one you prefer to use.

In [26]:
import numpy as np
np.mean(df.duration_ms)

194521.72884427715

The above mentioned code gives us some information about the duration of all songs. But what if we wanted the summary measures to represent the duration of the songs in minutes, rather than miliseconds? That would be far more interpretable.

In [None]:
# CODE HERE






### Plotting numerical data

Before we do any analysis, it's really useful to look at the distribution of our data. Let's use what you learnt in the previous lesson to plot the distribution of song duration, and its danceability (note that higher values represent higher danceability) via a histogram. You should also plot some box plots and compare the two; check if one plot provides you with more information than the other.

In [None]:
## CODE HERE






## Summarising Categorical Data

With non-numerical data, widely used measures of mean and standard deviation do not work. Instead, we use measures of frequency or spread (e.g. percentiles and range) depending on whether our categorical data can be ordered or not. Median can also be used in categorical data, given that the data can be ordered, otherwise it won't be meaningful.

Frequencies can be calculated by counting the number of occurences. One way to do that in Python is via the `value_counts()` method. 

In [27]:
df['genre'].value_counts()

Underground Rap    5875
Dark Trap          4578
Hiphop             3022
RnB                2099
Trap Metal         1956
Rap                1848
Emo                1680
Pop                 461
Name: genre, dtype: int64

Which one is the genre with the lowest number of songs?

Sometimes the overall count can be hard to interpret or misleasing, so we we can change the representation of the data to a proportion.

In [29]:
df['genre'].value_counts(normalize = True)

Underground Rap    0.273015
Dark Trap          0.212742
Hiphop             0.140434
RnB                0.097542
Trap Metal         0.090896
Rap                0.085878
Emo                0.078071
Pop                0.021423
Name: genre, dtype: float64

Sometimes, when we want to run specific statistics, we want our categoricald data to be binarised into column containing zero or one, where the latter only occurs when a specific category is found. This can be done via the `get_dummies()` method in pandas, by simply specifying which columns you want to binarise.

In [31]:
df = pd.get_dummies(data = df, columns= ['genre'])
df

Now, we can look at the frequency of each genre in a similar way.

In [34]:
df['genre_Pop'].value_counts()
df['genre_Pop'].value_counts(normalize=True)

0    0.978577
1    0.021423
Name: genre_Pop, dtype: float64

### Plotting categorical data

Let's also visualise our categorical data. Let's use what you learnt in the previous lesson to plot the distribution of the music genre, via a bar chart and a pie chart. Which one do you think is more useful to visualise this data and why?

In [None]:
## CODE HERE




