## Review: what we did in Week 3: Amazon data
* Read Amazon.csv
* Get to know the data
* Create a smaller subset of the data
## [Jump to Week 4 material](#thisWeek)

In [1]:
# imports and specifications
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

### read Amazon.csv

In [2]:
amazon = pd.read_csv('/Users/juandherrera/Google Drive/017_Machine Learning/ml/week04/Amazon.csv')

### get to know the data

In [3]:
print("amazon is:", type(amazon))
print("amazon has", amazon.shape[0], "rows and", amazon.shape[1], "columns", "\n")
print("the data types for each of the columns in amazon:")
print(amazon.dtypes, "\n")
print("the first 10 rows in amazon:")
print(amazon.head(10))

amazon is: <class 'pandas.core.frame.DataFrame'>
amazon has 455000 rows and 13 columns 

the data types for each of the columns in amazon:
Unnamed: 0                  int64
Id                          int64
ProductId                  object
UserId                     object
ProfileName                object
HelpfulnessNumerator        int64
HelpfulnessDenominator      int64
Score                       int64
Time                        int64
Summary                    object
Text                       object
helpScore                 float64
helpful                      bool
dtype: object 

the first 10 rows in amazon:
   Unnamed: 0      Id   ProductId          UserId       ProfileName  \
0      138806  138807  B000E63LME  A1CQGW1AOD0LF2  Alena K. "Alena"   
1      469680  469681  B004ZIH4KM  A37S7U1OX2MCWI        Becky Cole   
2      238202  238203  B003ZXE9QA  A2OM6G73E64EQ9              jeff   
3      485307  485308  B001RVFERK  A25W349EE97NBK          Tangent4   
4      375283  3752

### create a ndarray for `L`

In [4]:
L = amazon["helpful"]
print(type(L))
print(type(L.values))
print(L.shape)

<class 'pandas.core.series.Series'>
<class 'numpy.ndarray'>
(455000,)


### create a ndarray for `X`
Use only "Score" and "Time" as features, for now.

In [5]:
X = amazon[["Score", "Time"]]
print(type(X))
print(type(X.values))
print(X.shape)

<class 'pandas.core.frame.DataFrame'>
<class 'numpy.ndarray'>
(455000, 2)


## <a name='thisWeek'></a>Week 4: fit linear classifier using gradient descent and assess the fit of the model

### using the `SGDClassifier` class in `linear_model`, fit the model according to given training data

In [6]:
from sklearn import linear_model
sgd = linear_model.SGDClassifier(loss="squared_loss")
sgd.fit(X, L)



SGDClassifier(alpha=0.0001, average=False, class_weight=None,
       early_stopping=False, epsilon=0.1, eta0=0.0, fit_intercept=True,
       l1_ratio=0.15, learning_rate='optimal', loss='squared_loss',
       max_iter=None, n_iter=None, n_iter_no_change=5, n_jobs=None,
       penalty='l2', power_t=0.5, random_state=None, shuffle=True,
       tol=None, validation_fraction=0.1, verbose=0, warm_start=False)

In [7]:
# number & proportion of accurate predictions
print(sum(sgd.predict(X) == L.values))
print(sum(sgd.predict(X) == L.values) / L.shape)

33235
[0.07304396]


### how well did we do? compare the model's predictions for  `Y` to the labels `L`
We'll start with the first few measures in Flach, p. 57

In [8]:
import my_measures

sgd_pm = my_measures.BinaryClassificationPerformance(sgd.predict(X), L, 'sgd')
sgd_pm.compute_measures()
print(sgd_pm.performance_measures)

{'Pos': 33235, 'Neg': 421765, 'TP': 33235, 'TN': 0, 'FP': 421765, 'FN': 0, 'Accuracy': 0.07304395604395604, 'Precision': 0.07304395604395604, 'Recall': 1.0, 'desc': 'sgd'}


## Normalization

*[Normalization](https://scikit-learn.org/stable/modules/preprocessing.html#normalization) is the process of scaling individual samples to have unit norm.*

In [9]:
X.describe()

Unnamed: 0,Score,Time
count,455000.0,455000.0
mean,4.183233,1296260000.0
std,1.310769,48009700.0
min,1.0,939340800.0
25%,4.0,1271290000.0
50%,5.0,1311120000.0
75%,5.0,1332720000.0
max,5.0,1351210000.0


In [10]:
from sklearn import preprocessing
X_normalized = preprocessing.normalize(X)

In [11]:
pd.DataFrame(X_normalized).describe()

Unnamed: 0,0,1
count,455000.0,455000.0
mean,3.234304e-09,1.0
std,1.02596e-09,0.0
min,7.400776e-10,1.0
25%,2.994586e-09,1.0
50%,3.74614e-09,1.0
75%,3.860856e-09,1.0
max,5.322882e-09,1.0
