# Seminar Python 4: Using the scikit-learn package

For installing the package (to enable the import), the pip command is used - (Windows) Command prompt / (UNIX) terminal/shell:

`pip install scikit-learn`

The same is true for other Python packages: scipy, six, cycler, pyparsing, kiwisolver, python-dateutil, matplotlib, pytz, pandas, seaborn, numpy, sklearn, statsmodels etc.  
It may be necessary to upgrade the PIP (Python package installer) - see details here: https://datatofish.com/upgrade-pip/

In [None]:
#Check if package/module is avilable (handle import exceptions using try...except)
try:                 
    import sklearn
    print('Import OK.')
except ImportError as err: 
    print(err)

In [None]:
#running pip in Jupyter Lab: # IPython "magic command"
%pip install scikit-learn

## K-Means clustering with Python and scikit-learn
The K-Means method groups the observations from a dataset into K (=2, 3...) clusters - groups of related / similar observations.

In [None]:
#Example 1. Grouping a dataset in 3 clusters
import matplotlib.pyplot as plt
import numpy as np
from sklearn.cluster import KMeans

X = np.array([[5,3],
     [10,15],
     [15,12],
     [24,10],
     [30,45],
     [85,70],
     [71,80],
     [60,78],
     [55,52],
     [80,91]])
kmeans = KMeans(n_clusters=3)
kmeans.fit(X)
print(kmeans.cluster_centers_)
print(kmeans.labels_)
f1 = plt.figure()
plt.scatter(X[:,0],X[:,1], label='True Position')
f2 = plt.figure()
plt.scatter(X[:,0], X[:,1], c=kmeans.labels_, cmap='rainbow')
f3 = plt.figure()
plt.scatter(kmeans.cluster_centers_[:,0] ,kmeans.cluster_centers_[:,1], color='black')
plt.show()

This method is used as a machine learning technique, in order to find classes (clusters) of items based on a training set; based on these clusters, we can predict if other items (in the test set) are likely to belong to one of the classes (classification). See more details [here](https://towardsdatascience.com/understanding-k-means-clustering-in-machine-learning-6a6e67336aa1).

**Application example**  
The sinking of the Titanic in 1912 has produced 1505 casualties out of the total 2224 people on board (passengers and crew members).  
We are using the datasets `train.csv` and `test.csv`, which contain passenger data. The training dataset includes the `Survived` column.  
We may consider the hypothesis that survival was influenced by attributes like age, sex, ticket (passenger) class etc. We use K-Means to group passengers (training set) into two clusters - survivors and victims. Then we predict if other passengers were likely part of one or the other class.

In [None]:
#Example 2. Step 1. Import libraries
import pandas as pd
import numpy as np
from sklearn.cluster import KMeans
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import MinMaxScaler
import seaborn as sns
import matplotlib.pyplot as plt

In [None]:
#Exemplu 2. Step 2. Read data from files, print first 5 records
pd.options.display.max_columns = 12
test = pd.read_csv('test.csv')
train = pd.read_csv('train.csv')

print('*****test*****')
print(test.head())
print('*****train*****')
print(train.head())

In [None]:
#Example 2. Step 3. Compute statistics
print('*****test_stats*****')
print(test.describe())
print('*****train_stats*****')
print(train.describe())

Certain algorithms do not allow missing values. Therefore, these should be handled.

In [None]:
#Example 2. Step 4. View columns in the train set, check for missing values
print(train.columns.values)

print('*****train missing values *****')
print(train.isna())
print('*****test missing values*****')
print(test.isna())

In [None]:
#Example 2. Step 5. Calculate no. of missing values
print("*****In the train set*****")
print(train.isna().sum())
print("\n")
print('*****In the test set*****')
print(test.isna().sum())

In [None]:
#Example 2. Step 6. Replace missing values with column average, using  fillna()
train.fillna(train.mean(), inplace=True)
test.fillna(test.mean(), inplace=True)
print(train.isna().sum())
print(test.isna().sum())

In [None]:
#Example 2. Step 7. Evaluate survival depending on Pclass, Sex, SibSp
print(train[['Pclass', 'Survived']].groupby(['Pclass'], as_index=False).mean().sort_values(by='Survived', ascending=False))
print(train[["Sex", "Survived"]].groupby(['Sex'], as_index=False).mean().sort_values(by='Survived', ascending=False))

In [None]:
#Example 2. Step 8. Chart for analyzing Age-Survived and Pclass-Survived
g = sns.FacetGrid(train, col='Survived')
g.map(plt.hist, 'Age')
grid = sns.FacetGrid(train, col='Survived', row='Pclass')
grid.map(plt.hist, 'Age')
grid.add_legend()
plt.show()

In [None]:
#Example 2. Step 9. Show info about the training set
train.info()

In [None]:
#Example 2. Pas 10. Remove non-numeric columns with no relevance for survival
train = train.drop(['Name','Ticket', 'Cabin','Embarked'], axis=1)
test = test.drop(['Name','Ticket', 'Cabin','Embarked'], axis=1)

In [None]:
#Example 2. Pas 11. Transform data type for column Sex
labelEncoder = LabelEncoder()
labelEncoder.fit(train['Sex'])
labelEncoder.fit(test['Sex'])
train['Sex'] = labelEncoder.transform(train['Sex'])
test['Sex'] = labelEncoder.transform(test['Sex'])
train.info()
test.info()

In [None]:
#Example 2. Step 12. We use X as an array (numpy-array) based on the train set
#without the Survived column; y is a vector based on the Survived column
X = np.array(train.drop(['Survived'], 1).astype(float))
y = np.array(train['Survived'])

In [None]:
#Example 2. Step 13. Create a KMeans model with 2 clusters
#(survivors / casualties). 

kmeans = KMeans(n_clusters=2) 
#kmeans = KMeans(n_clusters=2, max_iter=600)
kmeans.fit(X)

#Other argumnents for Kmeans:
'''KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
    n_clusters=2, n_init=10, n_jobs=1, precompute_distances='auto',
    random_state=None, tol=0.0001, verbose=0)'''

In [None]:
#Example 2. Step 14. Evaluate results
correct = 0
for i in range(len(X)):
    predict_me = np.array(X[i].astype(float))
    predict_me = predict_me.reshape(-1, len(predict_me))
    prediction = kmeans.predict(predict_me)
    if prediction[0] == y[i]:
        correct += 1

print(correct/len(X))

The model has an accuracy of about 50%.
It may be imporved by scaling the input data.  
Example 2, step 15: open and run `Ex_2_15.ipynb`

In [None]:
# --- DO NOT RUN THIS CELL --- #
#Example 2. Complete code for KMmeans example - see also ex_2.py,
#-can be run as stand-alone program (in a terminal / IDLE / PyCharm)

import pandas as pd
import numpy as np
from sklearn.cluster import KMeans
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import MinMaxScaler
import seaborn as sns
import matplotlib.pyplot as plt
pd.options.display.max_columns = 12
test = pd.read_csv('test.csv')
train = pd.read_csv('train.csv')

print('*****test*****')
print(test.head())
print('*****train*****')
print(train.head())

print('*****test_stats*****')
print(test.describe())
print('*****train_stats*****')
print(train.describe())

print(train.columns.values)

print(train.isna())
print(test.isna())

print('*****In the train set*****')
print(train.isna().sum())
print("\n")
print('*****In the test set*****')
print(test.isna().sum())

train.fillna(train.mean(), inplace=True)
test.fillna(test.mean(), inplace=True)
print(train.isna().sum())
print(test.isna().sum())


print(train[['Pclass', 'Survived']].groupby(['Pclass'], as_index=False).mean().sort_values(by='Survived', ascending=False))
print(train[["Sex", "Survived"]].groupby(['Sex'], as_index=False).mean().sort_values(by='Survived', ascending=False))
print(train[["SibSp", "Survived"]].groupby(['SibSp'], as_index=False).mean().sort_values(by='Survived', ascending=False))

g = sns.FacetGrid(train, col='Survived')
g.map(plt.hist, 'Age')
grid = sns.FacetGrid(train, col='Survived', row='Pclass')
grid.map(plt.hist, 'Age')
grid.add_legend()
plt.show()

train.info()

train = train.drop(['Name','Ticket', 'Cabin','Embarked'], axis=1)
test = test.drop(['Name','Ticket', 'Cabin','Embarked'], axis=1)

labelEncoder = LabelEncoder()
labelEncoder.fit(train['Sex'])
labelEncoder.fit(test['Sex'])
train['Sex'] = labelEncoder.transform(train['Sex'])
test['Sex'] = labelEncoder.transform(test['Sex'])

train.info()

test.info()

X = np.array(train.drop(['Survived'], 1).astype(float))

y = np.array(train['Survived'])

scaler = MinMaxScaler()
X_scaled = scaler.fit_transform(X)
kmeans = KMeans(n_clusters=2)
kmeans.fit(X_scaled)
KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=600,
    n_clusters=2, n_init=10, n_jobs=1, precompute_distances='auto',
    random_state=None, tol=0.0001, verbose=0)
correct = 0
for i in range(len(X)):
    predict_me = np.array(X[i].astype(float))
    predict_me = predict_me.reshape(-1, len(predict_me))
    prediction = kmeans.predict(predict_me)
    if prediction[0] == y[i]:
        correct += 1

print(correct/len(X))

## Logistic regression
This machine learning technique is used for classification problems - predicting if an item is likely to belong or not to a class (binary logistic regression) - e.g. passenger is survivor or not.  
It works by fitting a regression model based on the sigmoid (logistic) function - instead of a line, like linear regression. For details, see [this page](https://towardsdatascience.com/introduction-to-logistic-regression-66248243c148).  
The regression model is determined by using the training dataset. Then we can attempt predictions for the items in the test set.

In [None]:
#Example 3. logistic regression
import pandas as pd
from sklearn.linear_model import LogisticRegression
pd.options.display.max_columns = 12
test = pd.read_csv('test1.csv')
train = pd.read_csv('train.csv')

print('*****test*****')
print(test[:4])
print('*****train*****')
print(train[:4])

train.fillna(train.mean(), inplace=True)
test.fillna(test.mean(), inplace=True)
print(train.isna().sum())
print(test.isna().sum())

train['IsFemale'] = (train['Sex'] == 'female').astype(int)
test['IsFemale'] = (test['Sex'] == 'female').astype(int)

predictors = ['Pclass', 'IsFemale', 'Age']
X_train = train[predictors].values
X_test = test[predictors].values
y_train = train['Survived'].values
y_test= test['Survived'].values
print(X_train[:5])
print(y_train[:5])

model = LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
penalty='l2', random_state=None, solver='lbfgs', tol=0.0001,
verbose=0, warm_start=False)

model.fit(X_train, y_train)
y_predict = model.predict(X_test)
print(y_predict)

print((y_test == y_predict).mean())

## Simple linear regression

In [None]:
#Example 4. Simple linear regression (OLS - Ordinary Least Squares)
import pandas as pd
import statsmodels.api as sm
pd.options.display.max_columns = 12
test = pd.read_csv('test.csv')
train = pd.read_csv('train.csv')

train.fillna(train.mean(), inplace=True)
test.fillna(test.mean(), inplace=True)

train['IsFemale'] = (train['Sex'] == 'female').astype(int)
test['IsFemale'] = (test['Sex'] == 'female').astype(int)

X_train = train['IsFemale'].values
X_train = sm.add_constant(X_train)

y_train = train['Survived'].values

model = sm.OLS(y_train, X_train)

results = model.fit()
print(results.params)
print(results.summary())

## Multiple linear regression

In [None]:
#Example 5. Multiple linear regression
import pandas as pd
import statsmodels.formula.api as smf

pd.options.display.max_columns = 12
test = pd.read_csv('test.csv')
train = pd.read_csv('train.csv')


train.fillna(train.mean(), inplace=True)
test.fillna(test.mean(), inplace=True)

train['IsFemale'] = (train['Sex'] == 'female').astype(int)
test['IsFemale'] = (test['Sex'] == 'female').astype(int)


X = pd.DataFrame(train, columns=['Pclass', 'IsFemale', 'Age'])
y = train['Survived']

results = smf.ols('y ~ Pclass + IsFemale + Age', data=train).fit()
print(results.params)

print(round(results.predict(train[:5])))

## Referenes
1. J. VanderPlas, Python Data Science Handbook: https://jakevdp.github.io/PythonDataScienceHandbook/index.html, Cap. 5
1. https://stackabuse.com/k-means-clustering-with-scikit-learn/ 
2. https://www.datacamp.com/community/tutorials/k-means-clustering-python
3. Wes McKinney, 2nd Edition of Python for Data Analysis DATA WRANGLING WITH PANDAS, NUMPY, AND IPYTHON, O’Reilley
4. https://towardsdatascience.com/logistic-regression-detailed-overview-46c4da4303bc 
5. https://www.statsmodels.org/dev/generated/statsmodels.regression.linear_model.RegressionResults.html 