In [1]:
from IPython.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))


<img src="img/company-logo.png" width=120 height=120 align="right">

Author: Prof. Manoel Gadi

Contact: manoelgadi@gmail.com

Teaching Web: http://mfalonso.pythonanywhere.com

Linkedin: https://www.linkedin.com/in/manoel-gadi-97821213/

Github: https://github.com/manoelgadi

Last revision: 27/October/2022



# Session 2 - Exploratory Data Analysis in Python

## Python Ecosystem
<img src="img/scipy_ecosystem.png" width=400 height=300 align="left">

## Python Pandas DataFrame
A **DataFrame** in pandas is a tabular, spreadsheet-like (Excel-like) structure with an **ordered** collection of columns, each with potentially a different type.

<img src="img/pandas_df_structure.png" width=400 height=300 align="left">


Like in Excel, DataFrames have **both** column name and row indexes:
* .iloc (index locator) accessing the data via a list [0,1,2,...,N] and 
* .loc accessing the data as dictionary {key:value}. 

There will be more tecninical details about pandas in further classes... Let's play with its basic functionalities first.

## Growing with Pandas - Reading files using Pandas

importing pandas library with an alias

In [None]:
print(dir())

In [None]:
import pandas as pd

In [None]:
print(dir())

### importing dataset using pd.read_excel() function

In [None]:
df = pd.read_excel("datasets/yahoo.xlsx")
df.head(2)

## import SAS dataset using pd.read_spss() function

In [None]:
df = pd.read_sas("datasets/DCSKINPRODUCT.sas7bdat")
df.head(2)

## import data from a SQL database

In [None]:
import sqlite3
conn = sqlite3.connect('datasets/company_balancesheet_database.db')
df = pd.read_sql("""
 SELECT *
    FROM balancesheet
""", conn)
df.head(2)

## import data from HTML directly from a Website!

In [None]:
list_dfs = pd.read_html('https://en.wikipedia.org/wiki/Minnesota')

In [None]:
type(list_dfs)

In [None]:
list_dfs[3]

In [None]:
len(list_dfs)

In [None]:
table_MN = pd.read_html('https://en.wikipedia.org/wiki/Minnesota', match='Historical population')

In [None]:
df = table_MN[0]
df.head(2)

## import dataset using pd.read_csv() function

In [None]:
df = pd.read_csv("datasets/anorexia.csv")
df.head(2)

---
## Exercise

Import 2 datasets into 2 different dataframes:
* census_train.csv
* census_test.csv

In [None]:
import pandas as pd

In [None]:
df_train = pd.read_csv("datasets/census_train.csv")

In [None]:
df_train

In [None]:
df_test = pd.read_csv("datasets/census_test.csv")

In [None]:
df_test 

---

# Descriptive statistics

Descriptive statistics is the process of describing the sample.
For this, centrality measures, dispersion measures, distribution form and outliers are used.
Visualizations can also be used to explore and make sense of the data.




## PART 1 - Measures of central tendency

### Mean (Arithmetic Mean or Arithmetic Average)

The mean (or average) is the most popular and well-known measure of central tendency. It can be used with both discrete and continuous data, although its use is more frequent with continuous data. The average is equal to the sum of all the values in the data set divided by the number of values in the data set. Therefore, if we have n values in a data set and they have values x1, x2, ..., xn, the mean of the sample, usually denoted by (pronounced x bar):

\begin{equation*}
\overline{x} =
\frac{( x_1 + x_2 + ... + x_n )} {N}
\end{equation*}

This formula is usually written in a slightly different way using the Greek capital letter, which is pronounced "sigma," which means "sum of ...":

\begin{equation*}
\overline{x} = \frac{\left( \sum_{k=1}^n x_k \right)} {N}
\end{equation*}

You may have noticed that the above formula refers to the sample mean. So why have we called it a sample mean? This is because, in statistics, samples and populations have very different meanings and these differences are very important, even if, in the case of the average, they are calculated in the same way. To recognize that we are calculating the population mean and not the sample mean, we use the lowercase Greek letter "mu", indicated as µ:

\begin{equation*}
\mu = \frac{\left( \sum_{k=1}^n x_k \right)} {N}
\end{equation*}

Lets calculate some "mean" using Python for the anorexia.csv database

In [2]:
# import libraries numpy, pandas, and scipy.stats module
import pandas as pd

In [3]:
# import dataset using pd.read_csv() function
df = pd.read_csv("datasets/anorexia.csv")

In [4]:
df.columns

Index(['ID', 'group', 'prewt', 'postwt', 'difwt'], dtype='object')

In [5]:
df.mean()

ID        197.736111
group       1.833333
prewt      82.408333
postwt     85.172222
difwt       2.763889
dtype: float64

In [6]:
df.std()

ID        76.495419
group      0.787222
prewt      5.182466
postwt     8.035173
difwt      7.983598
dtype: float64

In [8]:
df['group'].mode()

0    1
dtype: int64

In [12]:
df['group'].value_counts(normalize=True)

1    0.402778
2    0.361111
3    0.236111
Name: group, dtype: float64

__QUESTION FOR DISCUSSIOS__: What does the previous result mean?

Now, to continue analyzing mode, it would be a good idea to understand a little more about our database and mainly the __group__ column. According to https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/anorexia.html, our database anorexia.csv weight change data for young patients with female anorexia.
This dataset contains the recorded weights of anorexic patients before and after a treatment.

* Group 1: Behavioural therapy
* Group 2: Control group
* Group 3: Family therapy

The Prewt column

      Indicates the patient's weight before the study period, in pounds.

Postwt

      Indicates the patient's weight after the study period, in pounds.

---

It ends up being easier to see this result, if instead of using mode, we directly draw a __frequency table__ and a __histogram graph__ by group.

### Frequency table

In [24]:
df.groupby(['group']).describe().T

Unnamed: 0,group,1,2,3
ID,count,29.0,26.0,17.0
ID,mean,118.37931,213.5,309.0
ID,std,22.398045,7.648529,5.049752
ID,min,101.0,201.0,301.0
ID,25%,108.0,207.25,305.0
ID,50%,115.0,213.5,309.0
ID,75%,122.0,219.75,313.0
ID,max,227.0,226.0,317.0
prewt,count,29.0,26.0,17.0
prewt,mean,82.689655,81.557692,83.229412


In [25]:
df.groupby(['group']).describe().to_excel("results.xlsx")

In [None]:
df[['group','ID']].groupby(['group']).count()

In [22]:
df.describe()

Unnamed: 0,ID,group,prewt,postwt,difwt
count,72.0,72.0,72.0,72.0,72.0
mean,197.736111,1.833333,82.408333,85.172222,2.763889
std,76.495419,0.787222,5.182466,8.035173,7.983598
min,101.0,1.0,70.0,71.3,-12.2
25%,118.75,1.0,79.6,79.325,-2.225
50%,208.5,2.0,82.3,84.05,1.65
75%,226.25,2.0,86.0,91.55,9.1
max,317.0,3.0,94.9,103.6,21.5


# RECAP: WHICH DATAFRAME METHODS SHOULD YOU REMEMBER AFTER THIS CLASS?

* __describe()__:df.describe() - summary statistics of numeric variables
* __value_counts()__: df.group.value_counts() - frequency table applicable for categorical variables


## NICE TO REMBER:
* __groupby()__: df[['group','ID']].groupby(['group']).count() - df.groupby to do analysis by group like in SQL: 
SELECT COUNT(*) FROM df GROUP BY group;

# What should you know so far?

* df.head()
* df.describe()
* df.value_counts()
* df.groupby('column_name').describe()
    

Let's hear a "AMAZING" song: <a href="https://youtu.be/1ftw05DYasM"> https://youtu.be/1ftw05DYasM </a>

<a href="https://youtu.be/1ftw05DYasM"> <img src="img/head_describe_value_counts_song.png" width=400 height=300 align="center"> </a>

---