Pandas is the most popular data analysis library in Python. It is a powerful tool for data manipulation and analysis.

Most of the time, our data is stored as a file on our hard drive. We can read the data into a Pandas DataFrame using pandas functions.

In this example, we will explore the world happiness report data from Kaggle. Learn more about this dataset [here](https://www.kaggle.com/datasets/ajaypalsinghlo/world-happiness-report-2021?resource=download).

We are basically doing a less impressive version of [this notebook](https://www.kaggle.com/code/joshuaswords/awesome-eda-2021-happiness-population).

First, let's import pandas. Make sure to install it if you get a `ModuleNotFoundError`.

In [None]:
... # importing pandas as 'pd' is standard practice

Now, we can read the data into a DataFrame. We will use the `pd.read_csv` function. It takes two arguments: the path to the file, and the delimiter. 

The file we want to read is called `world_happiness_report.csv` and it's in the `data` folder. Therefore, the path that we want to pass in is `data/world_happiness_report.csv`. Notice that this path is relative to the location of the notebook.

The delimiter has a default value of '`,`', which is what we want, so we don't need t pass it in. 

In [None]:
df = ...

Dataframes have a bunch of useful methods that we can use to explore the data.

In [None]:
# we can see what the columns are by using the .columns attribute
...

# we can see the shape of the data using the .shape attribute
...

In [None]:
# we can get a summary of the data using the describe() method
...

In [None]:
# The head method displays the first 5 rows of the dataframe.
...

In [None]:
# The tail method displays the last 5 rows of the dataframe.
...

Let's run a quick analysis to see how life expectancy in Botswana has changed since we started collecting data.

First, let's extract the rows for Botswana. We can do that by indexing the DataFrame with the `Country name` column. We want rows where the value of the `Country name` column is `Botswana`, so we can use the `==` operator. This gets all of the row indexes where the value of the `Country name` column is `Botswana`.

In [None]:
botswana_df = ...

Now let's plot it using seaborn. Don't forget to import seaborn first.

We want to plot `Healthy life expectancy at birth` as our y-axis and `year` as our x-axis. We can do this using the `sns.lineplot` function. We pass in the dataframe, the x-axis column, and the y-axis column.

In [None]:
# first import seaborn as sns
...

# seaborn has a .lineplot method
# this method takes a dataframe in it's 'data' argument
# we can also pass an x and y axis to the method, using names from the df

...

Note that pandas also has a `.plot` method that can be used to plot dataframes very quickly, but with less customization.

In [None]:
botswana_df.plot(x="year", y="Healthy life expectancy at birth")

We can use this method to plot average life expectancy over time globally as well.

First, however, we need to create a column for average life expectancy. We can do this by using the `.mean` method on the `Healthy ife expectancy at birth` column.

In [None]:
# first use the .groupby method to group the data by 'year'
df_by_year = ...

# the .mean method returns the mean of each column in the group
means = ...

# plot it!
means.plot(x="year", y="Healthy life expectancy at birth")
print(means)

Hopefully, you can see how easy it is to explore data using pandas!

Let's try a more complex analysis, exploring `Positive affect` and `Perceptions of corruption`. First let's naively plot these two columns. We can do this by using the `sns.lineplot` function, put the dataframe in the `data` argument, and pass in the `Perceptions of corruption` column as the x-axis and the `Positive affect` as the y-axis.

In [None]:
...

Looks like there might be a trend, but a lineplot is probably not the best for this. Let's try a scatterplot with linear regression. Seaborn has a function called `regplot` that can do this.

In [None]:
...

The linear regression makes the trend much more clear, but is this correlation statistically significant?

Let's find out using pandas and a new package called `scipy`.

Before running any analysis, let's make sure that the data is cleaned. In this case, our data has values that are set to NaN (not a number).

In [None]:
# are there are nan?
num_nan = df.isna().sum()
print("\tNumber of NaN by Column")
print(num_nan)

Looks like we are missing a lot of data, especially in the perceptions of corruption column. For now, let's remove the rows with missing values. We can do this by using the `dropna` function.

Depending on your analysis, you may wan't to fill in this data with another value. There is a `fillna` function to do this.

In [None]:
# use dropna to remove the rows with nan
cleaned_df = ...

In [None]:
corruption = cleaned_df['Perceptions of corruption']
positive_affect = cleaned_df['Positive affect']

# pandas has a .corr method which will give us the correlation between two columns,
# but not the p-value
r = positive_affect.corr(corruption, method='pearson')
print("pandas pearson correlation:", r)

That works fine, but we want the statistical significance as well. Let's use the `scipy` package to do this.

In [None]:
# we can use scipy to calculate the significance of the correlation
# import scipy.stats
...

# use scipy.stats.personr to calculate the correlation
r, p = ...
print("scipy pearson correlation:", r, "p-value:", p)


alpha = 0.05
if p < alpha:
    print("The correlation is significant!")
else:
    print("The correlation is not significant!")


More info on correlation in python https://realpython.com/numpy-scipy-pandas-correlation-python/

Maybe we want to see all of the correlations between the variables. We can use the `corr` function to do this.

In [None]:
correlation_matrix = df.corr()
print("Correlation Matrix")
correlation_matrix

What about the significance of these correlations? Let's use `scipy` again. Here's an example of a function that can do this.

In [None]:
# taken from: https://stackoverflow.com/questions/25571882/pandas-columns-correlation-with-statistical-significance
from scipy.stats import pearsonr
import pandas as pd

def calculate_pvalues(df):
    df = df.dropna()._get_numeric_data()
    dfcols = pd.DataFrame(columns=df.columns)
    pvalues = dfcols.transpose().join(dfcols, how='outer')
    for r in df.columns:
        for c in df.columns:
            pvalues[r][c] = round(pearsonr(df[r], df[c])[1], 4)
    return pvalues

In [None]:
p_values = calculate_pvalues(df) 
p_values

Some more complicated code to show a nice table of correlations with significance

In [None]:
from scipy.stats import pearsonr
import numpy as np

significance_levels = [0.01, 0.05, 0.1]

x = corruption
y = positive_affect
rho = df.corr()
pval = df.corr(method=lambda x, y: pearsonr(x, y)[1]) - np.eye(*rho.shape)
significance = pval.applymap(lambda x: ''.join(['*' for t in significance_levels if x<=t]))
rho.round(2).astype(str) + significance