# Hypothesis testing and statistical analysis - Level 3
So far we have explored how to load, clean and summarise data in Python. In this module, we are going to learn how to test hypotheses using several methods.

This is level 3. Complete the code.

**Brief**

Did Apple Store apps receive better reviews than Google Play apps?

## The challenges of the workshop are:

1. Clean the two data sets
    * Load the two datasets
    * Pick the columns that we are going to work with 
    * Check the data types and fix them
    * Create a column called platform whose values are “apple” or “google”
2. Join the two data sets
    * To do that use the function `append` with the parameter `ignore_index`
    * Eliminate the `NaN` values
    * Only use the apps which contain reviews
3. Summarise the data visually and analytically (by the column `platform`)
    * Use the function .describe()
    * Use a boxplot
4. Test the following hypothesis “The differences in the average ratings of apple and google users are due to chance and not due to differences in the platforms” 
    * Let’s use traditional methods: parametric and non-parametric
    * Let’s use permutations

As you are going to see, the first steps of every single data analysis are loading and cleaning the data. Today is not an exception, so that is going to be our first step.

## Importing the libraries

In this case we are going to import pandas, numpy, and matplotlib.pyplot. From scipy, import stats.

## Challenge 1 -  Loading and cleaning data

### Load data
Load the data from the folder Stats in your Desktop, this data is from the website Kaggle. Kaggle is an extraordinary repository of data and good fun data competitions. The data from the Apple Store can be found [here](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps) and the data from Google Store can be found [here](https://www.kaggle.com/lava18/google-play-store-apps).

In [None]:
# Once the files are saved, we need to load them into Python using read_csv and pandas
# Let's create a variable called google where we are going to store the path of the file

# Let's read the csv file into a data frame called Google

# Let's observe the first three entries


In [None]:
# Once the files are saved, we need to load them into Python using read_csv and pandas
# Let's create a variable called apple where we are going to store the path of the file 

# Let's read the csv file into a data frame called Apple

# Let's observe the first three entries


Based on the documentation of both datasets, the most adequate columns to answer the brief are:

1. For Google:
    * `Category` # Do we need this?
    * `Rating`
    * `Reviews`
    * `Price` (maybe)
2. For Apple:    
    * `prime_genre` # Do we need this?
    * `user_rating` 
    * `rating_count_tot`
    * `price` (maybe)

### Subsetting
Let's select the columns that we want for both datasets.

Overwrite the subsets in the original variables.

In [None]:
# Let's subset the dataframe Google by only selecting the variables ['Category', 'Rating', 'Reviews', 'Price']

# Let's check the first three entries


In [None]:
# Let's subset the dataframe Apple by only selecting the variables ['prime_genre', 'user_rating', 'rating_count_tot', 'price']

# Let's check the first three entries


### Checking data types for both Apple and Google
In this part let's figure out whether the variables that we selected contain errors/mistakes in the expected datatype.

In [None]:
# Use the function dtypes
## Check out the data types of the dataframe Apple. Are the data types expected?


As you can see all the data types of `Apple` are expected. What about the data types of `Google`?

In [None]:
# Check out the data types of the dataframe Google. Are the data types expected?


Check out the unique values of the column Price in the dataset Google

In [None]:
# Use the function .unique to the column Price in the dataset Google


Interesting... There is a price called `Everyone`. That is a massive mistake. Let's check the datapoints that have the price value as `Everyone`.

In [None]:
# Let's check what is the data point which contains the price 'Everyone' using a subset of Google.
## Subset by the column Price that equals to `Everyone`.


Now it is time to eliminate that observation. 

In [None]:
# Let's eliminate that point because it has the wrong information.
## To do that, subset Google but instead of using 'is equal to' use 'is different from'.

### Check again the unique values of Google


Now the problem is that the prices contain the symbol `$`. Therefore for Python these values are still considered `str` elements and not numbers! So let's eliminate the dollar symbol and convert the column into a numeric data type.

In [None]:
# Let's create a variable called nosymb.
## That variable will take the Price column of Google and apply the str.replace function.
### In the parameters specify to find '$' and replace with nothing ''.

#### Apply pd.to_numeric() to the variable nosymb, save it as the Price column in Google


Let's check the data types of Google.

In [None]:
# Use the function dtypes


The column `Reviews` is still an object column, we need that column to be a numeric column.

In [None]:
# Use the function pd.to_numeric, save the result in the same column


In [None]:
# Let's check the data types of Google again


### New columns for `Apple` and `Google` called `platform`
Let's create a new column called `platform` where the value for the Apple dataframe is 'apple', and for Google is 'google'

In [None]:
# Create a column called 'platform' and add either 'apple' or 'google' respectively


### Changing the column names to unify the two datasets
Now we need to rename the variables of `Apple` to be the same as `Google` or vice versa.
In this case, we are going to change the `Apple` column names to the names of `Google` columns.

This is an important step to unify the two datasets.

In [None]:
# Changing the nam,es of the Apple columns with the Google columns names
Apple.columns = Google._ _ _


## Challenge 2 -  Combine the two datasets

Combine the two datasets into a single data frame called `df`

In [None]:
# Let's use the function append to Google 
## use Apple as the first parameter to append and set ignore_index to True

# Check 12 random points of your dataset


As you can see there are some `NaN` values, eliminate all the `NaN` values from the table

In [None]:
# Let's check first the dimesions of df before dropping `NaN` values

# Use the function .dropna to eliminate all the NaN values, overwrite in the same dataframe

# Check the dimensions of df


Now let's check how many apps have 0 reviews.

In [None]:
# Subset df by the column Reviews which is 0
## Use the function .count()


929 apps do not have reviews, we need to eliminate these points!

In [None]:
# Eliminate the points that have 0 reviews by subsetting df using the expression "different from" !=


## Challenge 3 -  Summarise the data visually and analytically (by the column `platform`)

### Analytical summary
We need a summary of the column `Rating` but separated by the different platforms

In [None]:
# To summarise analytically, let's use the function `.describe` to the column `Rating`
## after grouping by the variable `platform` 


### Visual summary
We need a summary of the column `Rating` but separated by the different platforms, let's use a boxplot!

In [None]:
# We can use the function boxplot on df
## set the parameters by = 'platform' and column = ['Rating']


## Challenge 4 -  Test whether there are significant differences between Apple and Google reviews

The first step to test a hypothesis is to define the hypothesis that you want to test. For that reason your task here is to provide a null and an alternative hypothesis to complete the brief.

H<sub>null</sub>:


H<sub>alternative</sub>:

The second step is to determine the significance level.

SL: 

Now that the hypotheses are defined, and also the significance level, we are going to try to reject or accept the null hypothesis. [Student's t-test](https://en.wikipedia.org/wiki/Student%27s_t-test#Independent_two-sample_t-test) helps to evaluate differences between two independent groups. Run a Student's t-test to compare the differences between Apple and Google reviews.

In [None]:
# Let's create the subset of the column 'Rating' by the different platforms. 
## Call the subsets 'apple' and 'google' 




In [None]:
# Let's use the function stats.ttest_ind
## in the parameters use the 'Rating' column of a subset for apple and google platform

# Check the test


What can you conclude? Maybe the following code will help

In [None]:
if t_test.pvalue <= 0.05:
    print('The p-value was:', t_test.pvalue,
          'The observed differences are very unlikely to be due to chance, we reject the null hypothesis')
else:
    print('The p-value was:', t_test.pvalue,
          'The observed differences are likely to be due to chance, we accept the null hypothesis')

Student's t-test has very strong assumptions that we probably overlooked. Could you write down the assumptions of the Student's t-test? You can find them [here](https://en.wikipedia.org/wiki/Student%27s_t-test#Independent_two-sample_t-test) 

As you can see Student's t-test assumes that the variances between groups are equal. This is not the case (check the analytical summary), we can still use Welch’s t-test using the same function that we used before.

In [None]:
# Let's use the function stats.ttest_ind
## in the parameters use the 'Rating' column of a subset for apple and google platform
### the parameter equal_var set to False


Both tests (Student's t-test and Welch’s t-test) assume that the data is normally distributed, we can test that using the function stats.normaltest

In [None]:
# Test the normal distribution of the apple reviews using the function stats.normaltest
## save the result in a variable called apple_normal


In [None]:
# Test the normal distribution of the google reviews using the function stats.normaltest
## save the result in a variable called google_normal


The null hypothesis of the normality test is that the data is normally distributed, given that the p-values are zero we can conclude that the data is not normally distributed. You can also asses that using a histogram to see the distribution of the data.

In [None]:
# Create a histogram of the apple reviews distribution


In [None]:
# Create a histogram of the google reviews distribution


There are non-parametric tests whose assumptions are more relaxed, the equivalent non-parametric version of a Student's t-test is Mann-Whitney U Test.

In [None]:
# Use the function stats.mannwhitneyu to test the differences between Apple and Google



### Permutations 
Permutations and bootstraping are the ultimate techniques to test differences between groups. The extraordinary power lies
in the fact that there are no previous assumptions, therefore it is very easy to apply to any problem.  
Check out more about permutations [here](http://rasbt.github.io/mlxtend/user_guide/evaluate/permutation_test/)

In [None]:
# Let's create a variable `Permutation1` and use an analytical summary after grouping by `platform`
## Use the function np.random.permutation
Permutation1 = np.random.permutation(_ _ _)
# Let's create a variable called Permutation in df with the Permutation1 values
df["Permutation"]=Permutation1
df.groupby(by=_ _ _)[_ _ _].describe()

In [None]:
# Let's compare with the previous analytical summary


In [None]:
# Let's create a vector with the differences - that will be the distribution of the null hypothesis 
## create an empty list called difference
### in a for loop create a variable called permutation using the function np.random.permutation
#### append to `difference` the mean of the permutation subset that corresponds to 'apple' minus
##### the mean of the permutation subset that corresponds to 'google'


In [None]:
# Let's see the distribution of the null hypothesis
## Call the plot 'histo'
### Use the function 'plt.hist'


In [None]:
# Let's see what was the observed difference between the averages

## Transform that number into the absolute value



In [None]:
# How many simulated permutations have equal or extreme value compared to the observed difference?


## What can you conclude?

Type your conclusion: 

## Extra challenge
Imagine that the observed difference was 0.022, calculate the p-value for this case.