# Machine Learning using Python

Machine Learning is a field that is at the forefront of computing today. In fact, it is omnipresent in the computing world! Even if you’ve never heard of Machine Learning before, you use it many times  (in a day!) without probably even realizing it! You make a Google search. There’s Machine Learning running in the background which ranks the results (so that you see the most relevant results first). You log in to your email account. It is Machine Learning that determines which emails should appear in your inbox and which ones in your spam folder.

Did you know? Machine Learning, or ML for short, is being used today to make driver-less cars, predict emergency room waiting times, identify whales in oceans based on audio recordings (so that ships can avoid hitting them), and make intelligent recommendations on which movie one should watch next on services such as Netflix.

### Machine Leaning libraries in Python
* Tensorflow
* scikit-learn
* Theano
* Pylearn2
* Pyevolve
* NuPIC
* Pattern
* Caffe

Other libraries
* Nilearn
* Statsmodels
* PyBrain (inactive)
* Fuel
* Bob
* skdata
* MILK
* IEPY
* Quepy
* Hebel
* mlxtend
* nolearn
* Ramp
* Feature Forge
* REP
* Python-ELM
* PythonXY
* XCS
* PyML
* MLPY (inactive)
* Orange
* Monte
* PYMVPA
* MDP (inactive)
* Shogun
* PyMC
* Gensim
* Neurolab
* FFnet (inactive)
* LibSVM
* Spearmint
* Chainer
* topik
* Crab
* CoverTree
* breze
* deap
* neurolab
* Spearmint
* yahmm
* pydeep
* Annoy
* neon
* sentiment

![caption](Steps-to-Predictive-Modelling.jpg)

Image source: https://upxacademy.com/introduction-machine-learning/

In [None]:
import numpy as np
import pandas as pd
from patsy import dmatrices
import statsmodels.api as sm
from sklearn import svm
import warnings
import matplotlib.pyplot as plt

# To display plots inside notebook
%matplotlib inline

warnings.filterwarnings('ignore')

# notebook parameters
pd.set_option('display.max_rows', 15)

# GET DATA

### Data Handling
#### Let's read our data in using pandas:

In [None]:
df = pd.read_csv(r"data/train.csv")

In [None]:
df

In [None]:
df.describe()

#### To view the columns individually

In [None]:
df['Name']

#### To find the occurence of each object

In [None]:
df['Pclass'].value_counts()

In [None]:
df['Sex'].value_counts()

# CLEAN, PREPARE, AND MANIPULATE DATA

In [None]:
df.apply(lambda x: sum(x.isnull()), axis=0)

In [None]:
df.apply(lambda x: sum(x.notnull()), axis=0)

#### To drop the column that is not required

In [None]:
df.drop(['Cabin'], axis=1)

In [None]:
df = df.drop(['Cabin'], axis=1)
# df.drop(['Cabin'], axis=1, inplace=True)

In [None]:
# Remove NaN values
df = df.dropna()
# df.dropna(inplace=True)

In [None]:
df.head()

#### Now let's check for the 'notnull' values

In [None]:
df.apply(lambda x: sum(x.notnull()), axis=0)

### Visualize our data graphically:

#### plot a bar graph of those who surived vs those who did not

In [None]:
df['Survived'].value_counts().plot(kind='bar')
plt.title("Distribution of Survival, (1 = Survived)")
plt.grid(True)

In [None]:
plt.scatter(df['Survived'], df['Age'])
plt.title("Survival by Age, (1 = Survived)")
plt.grid(True, axis='y')

In [None]:
df['Pclass'].value_counts().plot(kind="barh")
plt.grid(True)

#### Checking for 'class 1' passengers

In [None]:
df['Pclass'] == 1

#### Passing the 'class 1' passengers list to 'Age' --> To find out the age of 'class 1' passenges

In [None]:
df['Age'][df['Pclass'] == 1]

In [None]:
len(df['Age'][df['Pclass'] == 1])

In [None]:
df['Age'][df['Pclass'] == 1].plot(kind='kde')    
df['Age'][df['Pclass'] == 2].plot(kind='kde')
df['Age'][df['Pclass'] == 3].plot(kind='kde')
plt.xlabel("Age")    
plt.title("Age Distribution within classes")
# sets our legend for our graph.
plt.legend(('1st Class', '2nd Class', '3rd Class'), loc='best')
plt.grid(True)

### Exploratory Visualization:
The point of this tutorial is to predict if an individual will survive based on the features in the data like:

* Traveling Class (called pclass in the data)
* Sex
* Age
* Fare Price

In [None]:
df['Survived'].value_counts().plot(kind='barh', color="orange") 
plt.title("Survival Breakdown (1 = Survived, 0 = Died)")
plt.grid(True)

#### Find out the count of total male and female survived, in ascending order

In [None]:
df['Survived'][df['Sex'] == 'male'].value_counts()

In [None]:
df['Survived'][df['Sex'] == 'female'].value_counts()

In [None]:
df_male = df['Survived'][df['Sex'] == 'male'].value_counts().sort_index()
df_female = df['Survived'][df['Sex'] == 'female'].value_counts().sort_index()

In [None]:
df_male

In [None]:
df_female

In [None]:
df_male.plot(kind='barh', color='blue', label='Male', alpha=0.55)
df_female.plot(kind='barh', color='pink', label='Female', alpha=0.55)
plt.grid(True)
plt.legend(loc='best')
plt.title("Who Survived? with respect to Gender, (raw value counts)")

#### Now let's find out the ratio of survived people

In [None]:
df_male.sum()

In [None]:
df_female.sum()

In [None]:
df_male/float(df_male.sum())

In [None]:
df_female/float(df_female.sum())

In [None]:
(df_male/float(df_male.sum())).plot(kind='barh', label='Male', alpha=0.55)  
(df_female/float(df_female.sum())).plot(kind='barh', color='#FA2379', label='Female', alpha=0.55)
plt.title("Who Survived proportionally? with respect to Gender")
plt.grid(True)
plt.legend(loc='best')

#### Let's try going some more deeper, by finding out the the passenger class wise survival

In [None]:
female_highclass = df['Survived'][(df['Pclass'] != 3) & (df['Sex'] == 'female')].value_counts()
female_lowclass = df['Survived'][(df['Pclass'] == 3) & (df['Sex'] == 'female')].value_counts()
male_highclass = df['Survived'][(df['Pclass'] != 3) & (df['Sex'] == 'male')].value_counts()
male_lowclass = df['Survived'][(df['Pclass'] == 3) & (df['Sex'] == 'male')].value_counts()

In [None]:
female_highclass

In [None]:
female_lowclass

In [None]:
male_highclass

In [None]:
male_lowclass

In [None]:
# figure parameters
fig = plt.figure(figsize=(18,4), dpi=1600)

# Making subplots
# # equivalent but more general
# fig.add_subplot(1,1,1)
    
# # add subplot with red background
# fig.add_subplot(212, axisbg='r')

# 141 represents the following:
# 1 --> no of rows
# 4 --> no of columns
# 1 --> plot number
ax1 = fig.add_subplot(141)
female_highclass.plot(kind='bar', label='female, highclass', color='#FA2479', alpha=0.55)
ax1.set_xticklabels(["Survived", "Died"], rotation=0)
plt.title("Who Survived? with respect to Gender and Class")
plt.legend(loc='best')
plt.grid(True)

# 'sharey' --> shares the y axis with the mentioned axis
# now ax2 shares the y axis of ax1
ax2 = fig.add_subplot(142, sharey=ax1)
female_lowclass.plot(kind='bar', label='female, low class', color='pink', alpha=0.55)
ax2.set_xticklabels(["Died", "Survived"], rotation=0)
plt.legend(loc='best')
plt.grid(True)

ax3 = fig.add_subplot(143, sharey=ax1)
male_lowclass.plot(kind='bar', label='male, low class',color='lightblue', alpha=0.55)
ax3.set_xticklabels(["Died", "Survived"], rotation=0)
plt.legend(loc='best')
plt.grid(True)

ax4 = fig.add_subplot(144, sharey=ax1)
male_highclass.plot(kind='bar', label='male, highclass', color='steelblue', alpha=0.55)
ax4.set_xticklabels(["Died", "Survived"], rotation=0)
plt.legend(loc='best')
plt.grid(True)

In [None]:
df['Survived'][df.Sex == 'male'].value_counts().sort_index()

In [None]:
df['Survived'][df.Sex == 'female'].value_counts()

In [None]:
fig = plt.figure(figsize=(18,4), dpi=1600)

ax1 = fig.add_subplot(121)
df['Survived'][df.Sex == 'female'].value_counts().sort_index().plot(kind='bar', color='#FA2379', label='Female', alpha=0.55)
df['Survived'][df.Sex == 'male'].value_counts().sort_index().plot(kind='bar', label='Male', alpha=0.55)
plt.title("Who Survied? with respect to Gender.")
plt.legend(loc='best')
plt.grid(True)

ax2 = fig.add_subplot(122)
(df['Survived'][df['Sex'] == 'male'].value_counts()/float(df['Sex'][df['Sex'] == 'male'].size)).plot(kind='bar', label='Male', alpha=0.55)
(df['Survived'][df['Sex'] == 'female'].value_counts()/float(df['Sex'][df['Sex'] == 'female'].size)).plot(kind='bar', color='#FA2379', label='Female', alpha=0.55)
plt.title("Who Survied proportionally?")
plt.legend(loc='best')
plt.grid(True)

# TRAIN MODEL

#### Let's just create a formule for our model

In [None]:
# Ref: http://patsy.readthedocs.org/en/latest/formulas.html
formula = 'Survived ~ C(Pclass) + C(Sex) + Age + SibSp  + C(Embarked)'

####  'dmatrices' is used to used to create regression friendly dataframe

In [None]:
y, X = dmatrices(formula, data=df, return_type='dataframe')
# instantiate our model
model = sm.Logit(y, X)

# fit our model to the training data
res = model.fit()

# save the result for outputing predictions later
result = [res, formula]
res.summary()

#### Let's try to do something with machine learning

In [None]:
# Create our machine learning formula
formula_ml = 'Survived ~ C(Pclass) + C(Sex) + Age + SibSp + Parch + C(Embarked)'

In [None]:
# set plotting parameters
plt.figure(figsize=(8,6))

# create a regression friendly data frame
y, x = dmatrices(formula_ml, data=df, return_type='matrix')

# select which features we would like to analyze
# try chaning the selection here for diffrent output.
# Choose : [2,3] - pretty sweet DBs [3,1] --standard DBs [7,3] -very cool DBs,
# [3,6] -- very long complex dbs, could take over an hour to calculate! 
feature_1 = 2
feature_2 = 3

X = np.asarray(x)
X = X[:,[feature_1, feature_2]]  


y = np.asarray(y)
# needs to be 1 dimenstional so we flatten. it comes out of dmatirces with a shape. 
y = y.flatten()      

n_sample = len(X)

# will give a shuffled set of unique random integers of given range
np.random.seed(0)
order = np.random.permutation(n_sample)

X = X[order]
y = y[order].astype(np.float)

# do a cross validation
nighty_precent_of_sample = int(.9 * n_sample)
X_train = X[:nighty_precent_of_sample]
y_train = y[:nighty_precent_of_sample]
X_test = X[nighty_precent_of_sample:]
y_test = y[nighty_precent_of_sample:]

# create a list of the types of kerneks we will use for your analysis
types_of_kernels = ['linear', 'rbf', 'poly']

# specify our color map for plotting the results
color_map = plt.cm.Paired
# color_map = plt.cm.coolwarm

# fit the model
for fig_num, kernel in enumerate(types_of_kernels):
    clf = svm.SVC(kernel=kernel, gamma=3)
    clf.fit(X_train, y_train)

    plt.figure(fig_num)
    plt.scatter(X[:, 0], X[:, 1], c=y, zorder=10, cmap=color_map)

    # circle out the test data
    plt.scatter(X_test[:, 0], X_test[:, 1], s=80, facecolors='none', zorder=10)
    
    plt.axis('tight')
    x_min = X[:, 0].min()
    x_max = X[:, 0].max()
    y_min = X[:, 1].min()
    y_max = X[:, 1].max()

    XX, YY = np.mgrid[x_min:x_max:200j, y_min:y_max:200j]
    Z = clf.decision_function(np.c_[XX.ravel(), YY.ravel()])

    # put the result into a color plot
    Z = Z.reshape(XX.shape)
    plt.pcolormesh(XX, YY, Z > 0, cmap=color_map)
    plt.contour(XX, YY, Z, colors=['k', 'k', 'k'], linestyles=['--', '-', '--'],
               levels=[-.5, 0, .5])

    plt.title(kernel)
    plt.show()

In [None]:
clf = svm.SVC(kernel='poly', gamma=3).fit(X_train, y_train)

# TEST MODEL

In [None]:
clf.score(X_test, y_test)

### Predict results

In [None]:
clf.predict(np.array([0, 1]))