<a href="https://colab.research.google.com/github/jmbanda/BigDataProgramming_2019/blob/master/Class15_Python_for_Data_Analysis_Part1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Python for Data Analysis & Visualization (part 1)

Dataset to use:

***Salaries.csv and flights.csv found on iCollege***



Colab only code:

In [0]:
from google.colab import files
files.upload()

End of colab only code

In [0]:
#Import Python Libraries
import numpy as np
import scipy as sp
import pandas as pd
import matplotlib.pyplot as plt

Pandas is a python package that deals mostly with :
- **Series**  (1d homogeneous array)
- **DataFrame** (2d labeled heterogeneous array) 
- **Panel** (general 3d array)

### Pandas Series

Pandas *Series* is one-dimentional labeled array containing data of the same type (integers, strings, floating point numbers, Python objects, etc. ). The axis labels are often referred to as *index*.

In [0]:
# Example of creating Pandas series :
s1 = pd.Series( np.random.randn(5) )
print(s1)

We did not pass any index, so by default, it assigned the indexes ranging from 0 to len(data)-1

In [0]:
# View index values
print(s1.index)

In [0]:
# Creating Pandas series with index:
s2 = pd.Series( np.random.randn(5), index=['a', 'b', 'c', 'd', 'e'] )
print(s2)

In [0]:
# View index values
print(s2.index)

In [0]:
# Create a Series from dictionary
data = {'pi': 3.1415, 'e': 2.71828}  # dictionary
print(data)
s3 = pd.Series ( data )
print(s3)

In [0]:
# reordering the elements
s4 = pd.Series ( data, index = ['e', 'pi', 'tau'])
print(s4)

NAN (non a number) - is used to specify a missing value in Pandas.

In [0]:
# Creating a Pandas Series object from a single number:
s5 = pd.Series( 1, index = range(10), name='Ones')
print(s5)

In [0]:
s1

In [0]:
# Many ways to "slice" Pandas series (series have zero-based index by default):
print(s1)
s1[3]  # returns 4th element

In [0]:
s1[:2] # First 2 elements


In [0]:
print( s1[ [2,1,0]])  # Elements out of order

In [0]:
#Slicing series using index label (access series like a dictionary)

s4['pi']

In [0]:
dir(s4)

In [0]:
# Series can be used as ndarray:
print("Median:" , s4.median())

In [0]:
s1[s1 > 0]

In [0]:
# numpy functions can be used on series as usual:
s4[s4 > s4.median()]

In [0]:
# vector operations:
np.exp(s1)

In [0]:
# Unlike ndarray Series automatically allign the data based on label:
s5 = pd.Series (range(6))
print(s5)
s5[1:] + s5[:-1]

#### Popular Attributes and Methods:

|  Attribute/Method | Description |
|-----|-----|
| dtype | data type of values in series |
| empty | True if series is empty |
| size | number of elements |
| values | Returns values as ndarray |
| head() | First n elements |
| tail() | Last n elements |

*Complete exercise 1 in In Class Assignment notebook*

### Pandas DataFrame

Pandas *DataFrame* is two-dimensional, size-mutable, heterogeneous tabular data structure with labeled rows and columns ( axes ). Can be thought of a dictionary-like container to store python Series objects.

In [0]:
d =  pd.DataFrame({ 'Name': pd.Series(['Alice','Bob','Chris']), 
                  'Age': pd.Series([ 21,25,23]) } )
print(d)

In [0]:
#Add a new column:
d['height'] = pd.Series([5.2,6.0,5.6])
d

In [0]:
#Read csv file
df = pd.read_csv("Salaries.csv")

In [0]:
#Display a few first records
df.head(5)

---
*Exercise* 

*Complete exercise 2 in In Class Assignment notebook*

---

In [0]:
#Identify the type of df object
type(df)

In [0]:
#Check the type of a column "salary"
df['salary'].dtype

In [0]:
#List the types of all columns
df.dtypes

In [0]:
#List the column names
df.columns

In [0]:
#List the row labels and the column names
df.axes

In [0]:
#Number of dimensions
df.ndim

In [0]:
#Total number of elements in the Data Frame
df.size

In [0]:
#Number of rows and columns
df.shape

In [0]:
#Output basic statistics for the numeric columns
df.describe()

In [0]:
#Calculate mean for all numeric columns
df.mean()

---
*Complete exercise 3 in In Class Assignment notebook*

---
### Data slicing and grouping

In [0]:
#Extract a column by name (method 1)
df['sex'].head()

---

In [0]:
#Group data using rank
df_rank = df.groupby('rank')

In [0]:
#Calculate mean of all numeric columns for the grouped object
df_rank.mean()

In [0]:
df.groupby('sex').mean()

In [0]:
#Calculate the mean salary for men and women. The following produce Pandas Series (single brackets around salary)
df.groupby('sex')['salary'].mean()

In [0]:
# If we use double brackets Pandas will produce a DataFrame
df.groupby('sex')[['salary']].mean()

In [0]:
# Group using 2 variables - sex and rank:
df.groupby(['rank','sex'], sort=True)[['salary']].mean()

---
*Complete exercise 4 in In Class Assignment notebook*


---
### Filtering

In [0]:
#Select observation with the value in the salary column > 120K
df_sub = df[ df['salary'] > 120000]
df_sub.head()

In [0]:
df_sub.axes

In [0]:
#Select data for female professors
df_w = df[ df['sex'] == 'Female']
df_w.head()

In [0]:
# Using filtering, find the mean value of the salary for the discipline A
df[ df['discipline'] =='A'].mean().round(2)


---
### More on slicing the dataset

In [0]:
#Select column salary
df1 = df['salary']

In [0]:
#Check data type of the result
type(df1)

In [0]:
#Look at the first few elements of the output
df1.head()

In [0]:
#Select column salary and make the output to be a data frame
df2 = df[['salary']]

In [0]:
#Check the type
type(df2)

In [0]:
#Select a subset of rows (based on their position):
# Note 1: The location of the first row is 0
# Note 2: The last value in the range is not included
df[0:10]

In [0]:
#If we want to select both rows and columns we can use method .loc
df.loc[10:20,['rank', 'sex','salary']]

In [0]:
df_sub.head(15)

In [0]:
#Let's see what we get for our df_sub data frame
# Method .loc subset the data frame based on the labels:
df_sub.loc[10:20,['rank','sex','salary']]

In [0]:
#  Unlike method .loc, method iloc selects rows (and columns) by poistion:
df_sub.iloc[10:20, [0,3,4,5]]

### Sorting the Data

In [0]:
#Sort the data frame by yrs.service and create a new data frame
df_sorted = df.sort_values(by = 'service')
df_sorted.head()

In [0]:
#Sort the data frame by yrs.service and overwrite the original dataset
df.sort_values(by = 'service', ascending = False, inplace = True)
df.head()

In [0]:
# Restore the original order (by sorting using index)
df.sort_index(axis=0, ascending = True, inplace = True)
df.head()

In [0]:
#Sort the data frame using 2 or more columns:
df_sorted = df.sort_values(by = ['service', 'salary'], ascending = [True,False])
df_sorted.head(10)

### Missing Values - using the flights dataset

In [0]:
# Read a dataset with missing values
flights = pd.read_csv("flights.csv")
flights.head()

In [0]:
# Select the rows that have at least one missing value
flights[flights.isnull().any(axis=1)].head()

In [0]:
# Filter all the rows where arr_delay value is not missing:
flights1 = flights[ flights['arr_delay'].notnull( )]
flights1.head()

In [0]:
# Remove all the observations with missing values
flights2 = flights.dropna()

In [0]:
# Fill missing values with zeros
nomiss =flights['dep_delay'].fillna(0)
nomiss.isnull().any()

---
*Complete exercise 5 in In Class Assignment notebook*


---
### Common Aggregation Functions:

|Function|Description
|-------|--------
|min   | minimum
|max   | maximum
|count   | number of non-null observations
|sum   | sum of values
|mean  | arithmetic mean of values
|median | median
|mad | mean absolute deviation
|mode | mode
|prod   | product of values
|std  | standard deviation
|var | unbiased variance



In [0]:
# Find the number of non-missing values in each column
flights.describe()

In [0]:
# Find mean value for all the columns in the dataset
flights.min()

In [0]:
# Let's compute summary statistic per a group':
flights.groupby('carrier')['dep_delay'].mean()

In [0]:
# We can use agg() methods for aggregation:
flights[['dep_delay','arr_delay']].agg(['min','mean','max'])

In [0]:
# An example of computing different statistics for different columns
flights.agg({'dep_delay':['min','mean',max], 'carrier':['nunique']})

### Basic descriptive statistics

|Function|Description
|-------|--------
|min   | minimum
|max   | maximum
|mean  | arithmetic mean of values
|median | median
|mad | mean absolute deviation
|mode | mode
|std  | standard deviation
|var | unbiased variance
|sem | standard error of the mean
|skew| sample skewness
|kurt|kurtosis
|quantile| value at %


In [0]:
# Convinient describe() function computes a veriety of statistics
flights.dep_delay.describe()

In [0]:
# find the index of the maximum or minimum value
# if there are multiple values matching idxmin() and idxmax() will return the first match
flights['dep_delay'].idxmin()  #minimum value

In [0]:
# Count the number of records for each different value in a vector
flights['carrier'].value_counts()

# Method chaining

Method chaining with DataFrame is an act of chaining multiple methods that return a DataFrame and therefore are methods from DataFrame class. In the current version of Pandas, the reason to use method chaining is to not store intermediate variables and to avoid the following situation:

In [0]:
dfmc = pd.DataFrame({'a_column': [1, -999, -999],
                    'powerless_column': [2, 3, 4],
                    'int_column': [1, 1, -1]})
dfmc['a_column'] = dfmc['a_column'].replace(-999, np.nan)
dfmc['power_column'] = dfmc['powerless_column'] ** 2
dfmc['real_column'] = dfmc['int_column'].astype(np.float64)
dfmc = dfmc.apply(lambda _df: _df.replace(4, np.nan))
dfmc = dfmc.dropna(how='all')

Using chaining:

In [0]:
df2 = (pd.DataFrame({'a_column': [1, -999, -999],
                    'powerless_column': [2, 3, 4],
                    'int_column': [1, 1, -1]})
        .assign(a_column=lambda _df: _df['a_column'].replace(-999, np.nan),
                power_column=lambda _df: _df['powerless_column'] ** 2,
                real_column=lambda _df: _df['int_column'].astype(np.float64))
        .apply(lambda _df: _df.replace(4, np.nan))
        .dropna(how='all')
      )

**A lot nicer right?**

![alt text](https://miro.medium.com/max/500/1*RKYXAT6UWwb4y0X2LBXPSA.jpeg)

---
## Basic statistical Analysis

### Linear Regression

In [0]:
# Import Statsmodel functions:
import statsmodels.formula.api as smf

In [0]:
# create a fitted model
lm = smf.ols(formula='salary ~ service', data=df).fit()

#print model summary
print(lm.summary())

In [0]:
# print the coefficients
lm.params

In [0]:
#using scikit-learn:
from sklearn import linear_model
est = linear_model.LinearRegression(fit_intercept = True)   # create estimator object
est.fit(df[['service']], df[['salary']])

#print result
print("Coef:", est.coef_, "\nIntercept:", est.intercept_)


---
Example 2 

In [0]:
# Build a linear model for arr_delay ~ dep_delay


#print model summary


---
### Student T-test

The t test tells you how significant the differences between groups are; In other words it lets you know if those differences (measured in means/averages) could have happened by chance.

In [0]:
# Using scipy package:
from scipy import stats
df_w = df[ df['sex'] == 'Female']['salary']
df_m = df[ df['sex'] == 'Male']['salary']
stats.ttest_ind(df_w, df_m)   

Ttest_indResult(statistic=-2.2486865976699053, pvalue=0.027429778657910103)

The t score is a ratio between the difference between two groups and the difference within the groups. The larger the t score, the more difference there is between groups. The smaller the t score, the more similarity there is between groups. A t score of 3 means that the groups are three times as different from each other as they are within each other. When you run a t test, the bigger the t-value, the more likely it is that the results are repeatable.

A large t-score tells you that the groups are different.

A small t-score tells you that the groups are similar.

How big is “big enough”? Every t-value has a p-value to go with it. A p-value is the probability that the results from your sample data occurred by chance. P-values are from 0% to 100%. They are usually written as a decimal. For example, a p value of 5% is 0.05. Low p-values are good; They indicate your data did not occur by chance. For example, a p-value of .01 means there is only a 1% probability that the results from an experiment happened by chance. In most cases, a p-value of 0.05 (5%) is accepted to mean the data is valid.

Sources:

http://rcs.bu.edu/examples/python/data_analysis

https://www.statisticshowto.datasciencecentral.com/probability-and-statistics/t-test/