# Pandas, Matplotlib and Scikit Learn Overview

## What is pandas?
Python Library for Data manipulation and Analysis
 - Provide expressive data structures designed to make working with “relational” or “labeled” data both easy and intuitive.
 - Aims to be the fundamental high-level building block for doing practical, real world data analysis in Python.
 - Built on top of NumPy and is intended to integrate well within a scientific computing.
 - Inspired by R and Excel.
 
Pandas is well suited for many different kinds of data:
- **Tabular data** with heterogeneously-typed columns, as in an SQL table or Excel spreadsheet
- Ordered and unordered (not necessarily fixed-frequency) **time series data**.
- **Arbitrary matrix data** (homogeneously typed or heterogeneous) with row and column labels
- Any other form of observational / statistical data sets (can be unlabeled)

Two primary data structures
- **Series** (1-dimensional) – Similar to a column in Excel’s spreadsheet
- **Data Frame** (2-dimensional) – Similar to R’s data frame

A few of the things that Pandas does well
- Easy handling of **missing data** (represented as NaN)
- Automatic and explicit **data alignment**
- Read and Analyze **CSV** , Excel Sheets Easily
- Operations
- Filtering, Group By, Merging, Slicing and Dicing, Pivoting and Reshaping
- Plotting graphs

Pandas is very useful for interactive data exploration at the data preparation stage of a project

The offical guide to Pandas can be found [here](http://pandas-docs.github.io/pandas-docs-travis/10min.html)

## Pandas Objects

In [None]:
import pandas as pd
import numpy as np

**Series** like a column in a spreadsheet.

In [None]:
s = pd.Series([1,3,np.nan,'string'])
s

**DataFrame** like a spreadsheet – a dictionary of Series objects

In [None]:
data = [['ABC', -3.5, 0.01], ['ABC', -2.3, 0.12], ['DEF', 1.8, 0.03],
['DEF', 3.7, 0.01], ['GHI', 0.04, 0.43], ['GHI', -0.1, 0.67]]

df = pd.DataFrame(data, columns=['gene', 'log2FC', 'pval'])

df

### Viewing Data

Display the top and bottom rows of the frame

In [None]:
df.head()

In [None]:
df.tail()

In [None]:
df.tail(2)

Display the index, clumns and the underlying numpy data

In [None]:
df['log2FC'] # df.log2FC

In [None]:
df.columns

In [None]:
df.values

### Summarizations
`describe` shows a quick statistic
summary of your data

In [None]:
df.describe()

`info` provies a concise summar of a DataFrame

In [None]:
df.info()

## Input and Output
How do you get data into and out of Pandas as spreadsheets?
 - Pandas can work with XLS or XLSX files.
 - Can also work with CSV (comma separated values) file
 - CSV stores plain text in a tabular form
 - CSV files may have a header
 - You can use a variety of different field delimiters (rather than a ‘comma’). Check which delimiter your file is using before import!
 
__Import to Pandas__  
 > `df = pd.read_csv('data.csv', sep='\t', header=0)`

For Excel files, it's the same thing but with read_excel

__Export to text file__  
 > `df.to_csv('data.csv', sep='\t', header=True, index=False)`
 
The values of header and index depend on if you want to print the column and/or row names

# Case Study – Analyzing Titanic Passengers Data

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline
import numpy as np
import pandas as pd
import os


#set your working_dir
working_dir = os.path.join(os.getcwd(), 'titanic')

url_base = 'https://webcourse.cs.technion.ac.il/236756/Spring2018/ho/WCFiles/{}.csv'
train_url = url_base.format('train')
test_url = url_base.format('test')

# For .read_csv, always use header=0 when you know row 0 is the header row
train = pd.read_csv(train_url, header=0)
test = pd.read_csv(test_url, header=0)
# You can also load a csv file from a local file rather than a URL


In [None]:
# Display the top elements from the table
train.head()

#### VARIABLE DESCRIPTIONS:
survival - 0 = No; 1 = Yes  
pclass - Passenger Class (1 = 1st; 2 = 2nd; 3 = 3rd)  
sibsp - Number of Siblings/Spouses Aboard  
parch - Number of Parents/Children Aboard  
ticket - Ticket Number  
fare - Passenger Fare  
embarked - Port of Embarkation (C = Cherbourg; Q = Queenstown; S = Southampton)  

## Understanding the data...

In [None]:
train.shape

In [None]:
train.Pclass.size

In [None]:
# Count values of 'Survived'
train.Survived.value_counts()

In [None]:
# Calculate the mean fare price
train.Fare.mean()

In [None]:
# General statistics of the dataframe
train.describe()

### Selection examples

In [None]:
# Selection is very similar to standard Python selection
df1 = train[['Name', 'Sex', 'Age']]
df1.head()

In [None]:
df1[10:15]

In [None]:
df1[-4:-10:-2]

### Filtering Examples

In [None]:
# Filtering allows you to create masks given some conditions
df1.Sex == 'male' 

In [None]:
df1[(df1.Age > 35) & (df1.Sex == 'male')]

### Creating new columns/Data
We may be intereseted in creating new features that are derived from other columns.
In this example we create a 'prefix' (or passanger's title).

In [None]:
import re

prefix = []
for name in train.Name:
    try:
        m = re.search(', M[a-z]+.', name) # search for ', M' followed by 
                                          #     1 or more lowercase letters and a '.'
        prefix.append(m.group(0).strip(","))
    except:
        prefix.append("")

In [None]:
train['Prefix'] = prefix
train.head()

## Other operations
Pandas allows you to aggregate and display different views of your data.

In [None]:
df2 = train.groupby(['Pclass', 'Sex']).Fare.agg(np.mean)
df2

In [None]:
pd.pivot_table(train, index=['Pclass'], values=['Survived'], aggfunc='count')

In [None]:
pd.pivot_table(train, index=['Pclass', 'Sex'], values=['Age', 'Fare', 'Survived'], aggfunc='mean')

## Plotting
Basic plotting in pandas is pretty straightforward

In [None]:
new_plot = pd.crosstab([train.Pclass, train.Sex], train.Survived)
new_plot.plot(kind='bar', stacked=True, color=['red', 'blue'], grid=False)

In [None]:
train.Fare.hist()

## Concluding Remarks

* DataFrame cannot be used directly by Numpy  
It needs to transform DataFrame to Numpy-array:  
  > `pandas.DataFrame.as_matrix()`

* [Categorical type](http://pandas.pydata.org/pandas-docs/stable/categorical.html)  
 * Similar to ‘Factor’ in R but can preserve an order.

 * CSV file – Categorical variables are set to type ‘object’
Need to transform to categorical type
 * Converting categorical to numeric (usually int) type
 
 > * `Categorical.rename_categories()`  
    Convert categorical variable into dummy/indicator variables
  
 > * **`pandas.get_dummies()`**  
    Convert categorical variable into int variable

# What is Matplotlib

A 2D plotting library which produces publication quality figures.
 - Can be used in python scripts, the python and IPython shell, web application servers, and more …
 - Can be used to generate plots, histograms, power spectra, bar charts, errorcharts, scatterplots, etc.
 - For simple plotting, pyplot provides a MATLAB-like interface 
 - For power users, a full control via OO interface or via a set of functions

There are several Matplotlib add-on toolkits
 - Projection and mapping toolkits [basemap](http://matplotlib.org/basemap/) and [cartopy](http://scitools.org.uk/cartopy/).
 - Interactive plots in web browsers using [Bokeh](http://bokeh.pydata.org/en/latest/).
 - Higher level interface with updated visualizations [Seaborn](http://seaborn.pydata.org/index.html).

Matplotlib is available at [www.matplotlib.org](www.matplotlib.org)

## Line Plots

### plot against indices

In [None]:
import matplotlib.pyplot as plt
import numpy as np

In [None]:
x = np.arange(50) * 2*np.pi / 50
y = np.sin(x)
plt.plot(y)
plt.xlabel('index')
plt.show()

## multiple lines

In [None]:
x2 = np.arange(50) * 2*np.pi / 25
y2 = np.sin(x)
plt.plot(x, y, x2, y2)

In [None]:
plt.plot(x, np.sin(x), 'r-^')

In [None]:
plt.plot(x, y, 'b-o', x2, y2, 'r-^')
plt.axis([0, 7, -2, 2])

## Scatter plots

In [None]:
plt.scatter(x, y)

## colormapped scatter

In [None]:
x_rand = np.random.rand(200)
y_rand = np.random.rand(200)
size = np.random.rand(300)*30
color = np.random.rand(200)
plt.scatter(x_rand, y_rand, size, color)
plt.colorbar()

## Bar plots

In [None]:
plt.bar(x, y)

In [None]:
plt.barh(x, y, height=x[1]-x[0])

## Histogram

In [None]:
plt.hist(np.random.randn(1000))

In [None]:
plt.hist(np.random.randn(1000), 30)

## Subplots

In [None]:
import numpy as np
import matplotlib.pyplot as plt


x1 = np.linspace(0.0, 5.0)
x2 = np.linspace(0.0, 2.0)

y1 = np.cos(2 * np.pi * x1) * np.exp(-x1)
y2 = np.cos(2 * np.pi * x2)

plt.subplot(2, 1, 1)
plt.plot(x1, y1, 'ko-')
plt.title('A tale of 2 subplots')
plt.ylabel('Damped oscillation')

plt.subplot(2, 1, 2)
plt.plot(x2, y2, 'r.-')
plt.xlabel('time (s)')
plt.ylabel('Undamped')

plt.show()

## 3d plot

In [None]:
from numpy import *
import pylab as p
import mpl_toolkits.mplot3d.axes3d as p3

# u and v are parametric variables.
# u is an array from 0 to 2*pi, with 100 elements
u=r_[0:2*pi:100j]
# v is an array from 0 to 2*pi, with 100 elements
v=r_[0:pi:100j]
# x, y, and z are the coordinates of the points for plotting
# each is arranged in a 100x100 array
x=10*outer(cos(u),sin(v))
y=10*outer(sin(u),sin(v))
z=10*outer(ones(size(u)),cos(v))

fig=p.figure()
ax = p3.Axes3D(fig)
ax.plot_wireframe(x,y,z)
ax.set_xlabel('X')
ax.set_ylabel('Y')
ax.set_zlabel('Z')
p.show()

### Bokeh example

In [None]:
train.Fare.hist()

# What is Scikit Learn?

Scikit-learn is an open source machine learning library for Python.  
 - **Simple and efficient** tools for data mining and data analysis
 - Good coverage of machine learning algorithms, processes, tools and techniques
   - Classification, Regression, Clustering, Dimensionality Reduction, Model selection, Preprocessing
 - **High standards** 
 - Well-suited for applications:
   - Used for **large datasets**
   - **Building blocks** for application-specific algorithms
 - Built on **NumPy, SciPy**
 - **Open source**, Commercially usable - BSD license, Community driven

Data Representation in Scikit-learn:
 - Most algorithms expect a two-dimensional array, of shape (n_samples,n_features).
 - The arrays can be either NumPy arrays, or in some cases scipy.sparse matrices.
   - The number of features must be fixed in advance.

Design principles
 - Minimize number of object interfaces
 - Build abstractions for recurrent use cases
 - Simplicity, Simplicity, Simplicity
 
Code samples:
> 
``from sklearn import svm
clf = svm.SVC()
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)``

Classification:
>``y_pred = model.predict(X_test)``

Filters, dimension reduction, latent variables:
>``X_new = model.transform(X_test)``

Incremental learning:
>``model.partial_fir(X_train, y_train)``

---

The [scikit-learn website](http://scikit-learn.org/stable/) has great tutorials for using their library  
---
The [preprocessing](http://scikit-learn.org/stable/modules/preprocessing.html#preprocessing) page has information that is very relevant for the second exercise.

A more interfactive tutorial introducing scikit-learn can be found [here](https://www.datacamp.com/community/tutorials/machine-learning-python#gs.Ae7Ua_Y).

## Now, back to the Titanic...
### Prepare the data for ML
Reminder:

In [None]:
train = pd.read_csv(train_url, header=0)
test = pd.read_csv(test_url, header=0)
train.head()

#### "Sex" to int (binary)
> Transform to **Int** and store in a new 'Gender' columns

In [None]:
#Adding in new 'Gender' column to the dataframe
train['Gender'] = train['Sex'].map( {'female':0, 'male':1}).astype(int)
test['Gender'] = test['Sex'].map( {'female':0, 'male':1}).astype(int)

#### "Embarked" is multicategorical - Transform in two steps:
1. Transform to **categorical** and store in new 'Embarkport' columns
2. Transfom to **Int** and store in new 'EmbarkportInt' columns

In [None]:
train['Embarkport'] = train['Embarked'].astype("category")
train.head()

In [None]:
train['EmbarkportInt'] = train['Embarkport'].cat.rename_categories(range(train['Embarkport'].nunique())).astype(int)
train.head()

In [None]:
test['Embarkport'] = test['Embarked'].astype("category")
test['EmbarkportInt'] = test['Embarkport'].cat.rename_categories(range(test['Embarkport'].nunique())).astype(int)

An alternative is to convert the categorical data to one-hot representation:

In [None]:
embarked_dummies = pd.get_dummies(train['Embarkport'])
train_with_dummies = pd.concat([train, embarked_dummies], axis=1)
train_with_dummies.head()

In [None]:
#Cleaning - drop the newly created columns
train = train.drop(['Embarkport','EmbarkportInt'], axis=1) 
test = test.drop(['Embarkport','EmbarkportInt'], axis=1) 

## convert non-numeric data
Identify which of the orginal features are objects

In [None]:
ObjFeat=train.keys()[train.dtypes.map(lambda x: x=='object')]

In [None]:
# Transform the original features to categorical
# Creat new 'int' features, resp.
for f in ObjFeat:
    train[f] = train[f].astype("category")
    train[f+"Int"] = train[f].cat.rename_categories(range(train[f].nunique())).astype(int)
    train.loc[train[f].isnull(), f+"Int"] = np.nan #fix NaN conversion

    # Let's creat a crosstabcross-tabulation to look at this transformation
    # pd.crosstab(train[f+"Int"], train[f], rownames=[f+"Int"], colnames=[f])
        
    test[f] = test[f].astype("category")        
    if test[f].cat.categories.isin(train[f].cat.categories).all():
        test[f] = test[f].cat.rename_categories(train[f].cat.categories)
    else:    
        print("\n\nTrain and Test don't share the same set of categories in feature '", f, "'")
    test[f+"Int"] = test[f].cat.rename_categories(range(test[f].nunique())).astype(int)
    test.loc[test[f].isnull(), f+"Int"] = np.nan #fix NaN conversion

## Missing Values
Age has missing values.  
Create new 'AgeFill' column, where missing value is filled with the median of the gender and Pclass

In [None]:
median_ages = np.zeros((2,3))
for i in range(0, 2):
    for j in range(0, 3):
        median_ages[i,j] = train[(train['Gender'] == i) & 
                              (train['Pclass'] == j+1)]['Age'].dropna().median() 

median_ages

In [None]:
train['AgeFill'] = train['Age']
test['AgeFill'] = test['Age']

# Fill the missing values with medians
for i in range(0, 2):
    for j in range(0, 3):
        train.loc[ (train.Age.isnull()) & (train.Gender == i) & (train.Pclass == j+1),\
                'AgeFill'] = median_ages[i,j]
        test.loc[ (test.Age.isnull()) & (test.Gender == i) & (test.Pclass == j+1),\
                 'AgeFill'] = median_ages[i,j]

train[ train['Age'].isnull() ][['Gender','Pclass','Age','AgeFill']].head(10)

In [None]:
# Create a feature that records whether the Age was originally missing
train['AgeIsNull'] = pd.isnull(train.Age).astype(int)
test['AgeIsNull'] = pd.isnull(test.Age).astype(int)

train[['Gender','Pclass','Age','AgeFill','AgeIsNull']].head(10)

## Scaling

In [None]:
from sklearn import preprocessing
min_max_scaler = preprocessing.MinMaxScaler()
train['AgeFill0-1'] = min_max_scaler.fit_transform(train['AgeFill'].values.reshape(-1, 1))
test['AgeFill0-1'] = min_max_scaler.fit_transform(test['AgeFill'].values.reshape(-1, 1))
train[['Gender','Pclass','Age','AgeFill','AgeFill0-1','AgeIsNull']].head()

In [None]:
train.drop('AgeFill', axis=1);

## Feature Construction
Parch is the number of parents or children onboard, 
and SibSp is the number of siblings or spouses. 
> We can aggregate them together to form a new 'FamilySize' 

In [None]:
train['FamilySize'] = train.SibSp + train.Parch
test['FamilySize'] = test.SibSp + test.Parch

Pclass had a large effect on survival, and it's possible Age will too.
>A Constructed feature is the Age and Pclass multiplication, thus amplifying 
old age and 3rd class - both were less likely to survive

In [None]:
train['Age*Class'] = train.AgeFill * train.Pclass
test['Age*Class'] = test.AgeFill * test.Pclass

train['Age*Class'].hist()

## Final Preperations

In [None]:
# categorical columns
train.dtypes[ObjFeat]
test.dtypes[ObjFeat]

#### Drop the columns that will not be used in training

In [None]:
train = train.drop(ObjFeat, axis=1) 
test = test.drop(ObjFeat, axis=1) 

train.dtypes

#### Create training and testing sets that have no null (by dropping rows ) 

In [None]:
train_noNaN = train.dropna()
test_noNaN = test.dropna()

train_noNaN.info()

In [None]:
test_noNaN.info()

#### Convert to Numpy array

In [None]:
train_data_X = train.drop(['Survived'], axis=1).values
train_data_Y = train.Survived.values
test_data = test.values

train_data_X_noNaN = train_noNaN.drop(['Survived'], axis=1).values
train_data_Y_noNaN = train_noNaN.Survived.values
test_data_noNaN = test_noNaN.values

## Feature selection example

In [None]:
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import mutual_info_classif

univariate_filter = SelectKBest(mutual_info_classif, k=4).fit(train_data_X_noNaN, train_data_Y_noNaN)

univariate_filter.transform(train_data_X_noNaN)
univariate_filter.transform(test_data_noNaN);

## Training Random Forest Classifier

In [None]:
# Import the random forest package
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn import metrics

# Prepare train and test data using cross validation
X_train_noNaN, X_test_noNaN, y_train_noNaN, y_test_noNaN = train_test_split(train_data_X_noNaN,
                                                                            train_data_Y_noNaN)

# Create the random forest object which will include all the parameters
# for the fit
forest = RandomForestClassifier(n_estimators=3)

# Fit the training data to the Survived labels and create the decision trees
forest = forest.fit(X_train_noNaN, y_train_noNaN)

# output = forest.predict(test_data_noNaN)
y_pred_noNaN = forest.predict(X_test_noNaN)

print('accuracy:', metrics.accuracy_score(y_test_noNaN, y_pred_noNaN))
print('precision:', metrics.precision_score(y_test_noNaN, y_pred_noNaN))
print('recall:', metrics.recall_score(y_test_noNaN, y_pred_noNaN))
print('f1 score:', metrics.f1_score(y_test_noNaN, y_pred_noNaN))


## TODO: Implement the following

## SVM

In [None]:
from sklearn.svm import SVC

svm = SVC(gamma='auto')
svm = svm.fit(X_train_noNaN, y_train_noNaN)
y_pred_noNaN = svm.predict(X_test_noNaN)

print('accuracy:', metrics.accuracy_score(y_test_noNaN, y_pred_noNaN))
print('precision:', metrics.precision_score(y_test_noNaN, y_pred_noNaN))
print('recall:', metrics.recall_score(y_test_noNaN, y_pred_noNaN))
print('f1 score:', metrics.f1_score(y_test_noNaN, y_pred_noNaN))

### Create a K-Folds cross validation iterator

In [None]:
from sklearn.model_selection import KFold

kf = KFold(n_splits=5)

print(kf)  

### Train/Test Random Forest and SVM using K-Fold cross validation

In [None]:
for k, (train_index, test_index) in enumerate(kf.split(train_data_X_noNaN)):
    X_train_noNaN, X_test_noNaN = train_data_X_noNaN[train_index], train_data_X_noNaN[test_index]
    y_train_noNaN, y_test_noNaN = train_data_Y_noNaN[train_index], train_data_Y_noNaN[test_index]

    forest = RandomForestClassifier(n_estimators=3)
    forest = forest.fit(X_train_noNaN, y_train_noNaN)
    y_pred_noNaN_RF = forest.predict(X_test_noNaN)

    svm = SVC(gamma='auto')
    svm = svm.fit(X_train_noNaN, y_train_noNaN)
    y_pred_noNaN_SVM = svm.predict(X_test_noNaN)

    # results
    print("[fold {0}] RF score: {1:.5}, SVM score: {2:.5}".
          format(k, metrics.accuracy_score(train_data_Y_noNaN[test_index], y_pred_noNaN_RF),
                 metrics.accuracy_score(train_data_Y_noNaN[test_index], y_pred_noNaN_SVM)))
