#  Machine Learning Project

## Part I. Exploratory data analysis (correlation analysis)

### Check the data quality first and perform data preparation tasks


In [None]:
# import packages
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

#regression packages
import sklearn.linear_model as lm

#model validation
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score

#model evaluation
from sklearn.metrics import mean_squared_error
from sklearn.metrics import explained_variance_score

#f_regression (feature selection)
from sklearn.feature_selection import f_regression
from sklearn.feature_selection import SelectKBest

# recursive feature selection (feature selection)
from sklearn.feature_selection import RFE

import statsmodels.api as sm
from statsmodels.formula.api import ols

#ignore warning
import warnings
warnings.filterwarnings("ignore")

In [None]:
#load and display the data
df = pd.read_csv("movie_metadata.csv")
df.head(2)


In [None]:
# check column names and number of unique values
df.nunique()

In [None]:
# identify categorical and convert into numerical
# create dummy variables or colummn for color
df = pd.get_dummies(df, columns=['color'])
df.head(2)

In [None]:
#check data types & missing values
df.info()

In [None]:
# check the number of missing values for each column
df.isnull().sum()

In [None]:
#handling missing value: 
#remove the rows with any missing value
df = df.dropna()
df.info()

In [None]:
# remove duplicates
mask = df.duplicated(keep=False)
print(mask)

In [None]:
# display those duplicate rows for review
df[mask]

In [None]:
# find out how many 
len(df[mask])

In [None]:
# now drop those duplicated
dfnew = df.drop_duplicates(keep="first")
dfnew.head(2)

In [None]:
# remove unnecessary columns（movie_title，director_name，actor_1_name，actor_2_name，actor_3_name，plot_keywords，country,movie_imdb_link,language,content_rating,title_year, genres）
dfnew = dfnew.drop(['movie_title', 'director_name', 'actor_1_name','actor_2_name','actor_3_name','country','plot_keywords','movie_imdb_link','language','content_rating','title_year','genres'], axis=1)
dfnew.head(2)

In [None]:
# a statistical summay
dfnew.describe()

### Perform correlation analysis and discuss the results. What variables are correlated to imdb_score? How are some key variables correlated to each other?

In [None]:
# correlation analysis
dfnew.corr()['imdb_score']

In [None]:
# correlation heatmap
plt.figure(figsize=(12,12))
sns.heatmap(dfnew.corr(), annot=True);

In [None]:
# correlation plot(scatter plot)
sns.pairplot(dfnew) 

In [None]:
# scatter plot for 'gross' and 'imdb_score'
dfnew.plot(x="gross", y="imdb_score", kind="scatter", title="Relationship between gross and imdb_score", grid=True, s=1)

In [None]:
# scatter plot for 'num_voted_users' and 'imdb_score'
dfnew.plot(x="num_voted_users", y="imdb_score", kind="scatter", title="Relationship between num_voted_users and imdb_score", grid=True, s=1)

In [None]:
# scatter plot for 'movie_facebook_likes' and 'imdb_score'
dfnew.plot(x="movie_facebook_likes", y="imdb_score", kind="scatter", title="Relationship between movie_facebook_likes and imdb_score", grid=True, s=1)

In [None]:
# lmplot for num_voted_users and imdb_score（Plot data and regression model）
sns.lmplot(x='num_voted_users', y='imdb_score', data=dfnew)

In [None]:
# lmplot for duration and imdb_score（Plot data and regression model）
sns.lmplot(x='duration', y='imdb_score', data=dfnew)

In [None]:
# lmplot for num_critic_for_reviews and imdb_score（Plot data and regression model）
sns.lmplot(x='num_critic_for_reviews', y='imdb_score', data=dfnew)

In [None]:
# lmplot for num_user_for_reviews and imdb_score（Plot data and regression model）
sns.lmplot(x='num_user_for_reviews', y='imdb_score', data=dfnew)

* ***The results of the correlation analysis***

* imdb_score has **positive correlation** with num_critic_for_reviews,duration,director_facebook_likes,actor_3_facebook_likes, actor_1_facebook_likes,gross,num_voted_users,cast_total_facebook_likes,num_user_for_reviews,budget,actor_2_facebook_likes,aspect_ratio ,color_black_and_white and movie_facebook_likes
* imdb_score has **high positive correlation** with ***_num_voted_users,duration,num_critic_for_reviews and num_user_for_reviews_.***
* imdb_score has **negative correlation** with ***_facenumber_in_poster and Color_color_***.

In [None]:
# check the correlation between X variables 
# drop imdb_score  column and only keep the X variables
df1=dfnew.drop('imdb_score', axis=1)
df1.head(2)

In [None]:
# check the correlation
df1.corr()

In [None]:
# correlation plot
plt.figure(figsize=(8,8))
sns.heatmap(df1.corr(), vmax=.8, square=True,annot=True, fmt=".1f")

***The results of the correlation analysis***

* **_num_critic_for_reviews_** has high positive correlation with ***_gross,num_voted_users,num_user_for_reviews and movie_facebook_likes_.***
* **_actor_3_facebook_likes_** has high positive correlation with ***_cast_total_facebook_likes,actor_2_facebook_likes_.***
* **_actor_1_facebook_likes_** has high positive correlation with ***_cast_total_facebook_likes_.***

* **_facenumber_in_poster_** has negative correlation with ***_num_critic_for_reviews_.***

## Part II. Regression

**Normalization the data**
* Apply normalizer to our data and run regresssion analysis

In [None]:
y = dfnew['imdb_score'] 
X = dfnew.drop(['imdb_score'], axis =1)

In [None]:
# data is not scaled ... some columns have wide scales
plt.boxplot(X)

In [None]:
X.head(2)

* **gross and budget** appear to be not good predictors for imdb_score. So let's remove them.

In [None]:
from sklearn.preprocessing import Normalizer

y =dfnew['imdb_score']  
X =dfnew.drop(['imdb_score','gross','budget'], axis =1)   # remove gross and budget

# model building
scaler = Normalizer().fit(X)
normalizedX = scaler.transform(X)
normalizedX

In [None]:
# boxplot ... data is normalized ... all columns are in same scale
plt.boxplot(normalizedX);

### Build regression models using at least three different regression algorithms, including Lasso. The Y value is imdb_score. 

#### Model #1(full model)

**Model Validation: Split validation**

In [None]:
# split validation (70% training & 30% testing data)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

print(X_train.shape, y_train.shape, X_test.shape, y_test.shape)

In [None]:
# let's double check 

print(len(dfnew))
print(len(dfnew) * 0.7)        # 70% of the original data
print(len(dfnew) * 0.3)        # 30% of the original data

**Model Building**

In [None]:
# build full model using all variables
model1 = lm.LinearRegression()
model1.fit(X_train, y_train)   
model1_y = model1.predict(X_test)# generate predicted y for model evaluation

In [None]:
# this is regression so it has coefficients and y-intercept

print('Coefficients: ', model1.coef_)
print("y-intercept ", model1.intercept_)

In [None]:
pd.DataFrame(list(zip(X.columns, np.transpose(model1.coef_)))).round(2)

**Model Evaluation**

In [None]:
print("mean square error: ", mean_squared_error(y_test, model1_y))
print("variance or r-squared: ", explained_variance_score(y_test, model1_y))

In [None]:
plt.subplots()
plt.scatter(y_test, model1_y)       # showing actual y as X-axis and predicted y as Y-axis
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'k--', lw=4)   #dotted line represents perfect prediction (actual = predicted)
plt.xlabel('Actual')
plt.ylabel('Predicted')
plt.show()

* Model 1 doesn't seem to be very good.

#### Model #2(Lasso regression (Regularization))

In [None]:
#Fit the model
model2 = lm.Lasso(alpha=0.1)
model2.fit(X_train, y_train)   
model2_y = model2.predict(X_test)
           

In [None]:
print('Coefficients: ', model2.coef_)
print("y-intercept ", model2.intercept_)

In [None]:
pd.DataFrame(list(zip(X.columns, np.transpose(model2.coef_)))).round(2)

In [None]:
print("mean square error: ", mean_squared_error(y_test, model2_y))
print("variance or r-squared: ", explained_variance_score(y_test, model2_y))

#### Model #3(f_Regression, k=2)

In [None]:
#selec only 2 X variables
X_new = SelectKBest(f_regression, k=2).fit_transform(X, y)
X_new

In [None]:
# what are those two columns?
selector = SelectKBest(f_regression, k=2).fit(X, y)
idxs_selected = selector.get_support(indices=True)
print(idxs_selected)

In [None]:
# show the selected X variables
X.columns[selector.get_support()]

f_regression determines that **duration** and **num_voted_users** are two most important predictors

In [None]:
# split validation (using X_new)

X_train, X_test, y_train, y_test = train_test_split(X_new, y, test_size=0.3, random_state=0)

print(X_train.shape, y_train.shape, X_test.shape, y_test.shape)

In [None]:
# Model Building

model3 = lm.LinearRegression()
model3.fit(X_train, y_train)
model3_y = model3.predict(X_test)

# Model Evaluation
print("mean square error: ", mean_squared_error(y_test, model3_y))
print("variance or r-squared: ", explained_variance_score(y_test, model3_y))

#### Model #4(f_Regression, k=3)

In [None]:
X_newer = SelectKBest(f_regression, k=3).fit_transform(X, y)
X_newer

In [None]:
# what are those three columns?

selector = SelectKBest(f_regression, k=3).fit(X, y)
idxs_selected = selector.get_support(indices=True)
print(idxs_selected)

In [None]:
# show the selected X variables
X.columns[selector.get_support()]

f_regression determines that **'num_critic_for_reviews', 'duration', 'num_voted_users'** are three most important predictors

In [None]:
# split validation (using X_new)
X_train, X_test, y_train, y_test = train_test_split(X_newer, y, test_size=0.3, random_state=0)

print(X_train.shape, y_train.shape, X_test.shape, y_test.shape)

In [None]:
# Model Building
model4 = lm.LinearRegression()
model4.fit(X_train, y_train)
model4_y = model4.predict(X_test)

# Model Evaluation
print("mean square error: ", mean_squared_error(y_test, model4_y))
print("variance or r-squared: ", explained_variance_score(y_test, model4_y))

#### Model #5(Recursive Feature Selection)

In [None]:
lr = lm.LinearRegression()
rfe = RFE(lr, n_features_to_select=2)
rfe_y = rfe.fit(X,y)

print("Features sorted by their rank:")
print(sorted(zip([x for x in rfe.ranking_], X.columns)))

RFE determines that **'color_ Black and White''color_Color'** are two most important predictors.

In [None]:
# Choose two variables as X (color_ Black and White and color_Color) and develop a multiple linear regression model (model4).
y = dfnew['imdb_score'] 
X = dfnew[['color_ Black and White','color_Color']]

In [None]:
# split validation (70% training & 30% testing data)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

print(X_train.shape, y_train.shape, X_test.shape, y_test.shape)

In [None]:
# build a multiple regression model below

model5 = lm.LinearRegression()
model5.fit(X_train, y_train)   
model5_y = model5.predict(X_test)

In [None]:
print('Coefficients: ', model5.coef_)
print("y-intercept ", model5.intercept_)

In [None]:
pd.DataFrame(list(zip(X.columns, np.transpose(model5.coef_)))).round(2)

In [None]:
# model evaluation
print("mean square error: ", mean_squared_error(y_test, model5_y))
print("variance or r-squared: ", explained_variance_score(y_test, model5_y))

### Evaluate the models.

***We build 5 models:***

* MODEL 1(full model), with the mean square error **0.7047386578949048** and  the variance or r-squared is **0.35856948126783417**.

* MODEL 2:(Lasso regression),with the mean square error **0.7252891444939343** and the variance or r-squared is **0.3398899983208514**.

* MODEL 3:（f_Regression, k=2）, with the mean square error **0.7830449512437508** and the variance or r-squared is  **0.287430567797212**.

* MODEL 4:(f_Regression, k=3), with the mean square error **0.7777377932878307**, and the variance or r-squared is  **0.292239918741508**.

* MODEL 5:(Recursive Feature Selection), with the mean square error **3.6977823246083156e+23** and the variance or r-squared is   **-3.3647108281847255e+23**.

### What is your best model? What is the accuracy?

* The best model is **Lasso model** model with 2 variables(**duration** and **facenumber_in_poster**)
* According to the result above, the full model is with smallest MSE and higher r-square, However, it is accurate but too complex due to too many X variables.
* Then the No.2 would be Lasso model with 2 X variables(**duration** and **facenumber_in_poster**), they are accurate enough with an r-square of 0.34 and MSE 0.73, and simple and practical.

## Part III. Classification

### The goal is to build a classification model to predict if a movie is good or bad. You need to create a new “categorical” column from imdb_score in order to build classification models. Create the column by “converting” the imdb_score into 2 categories (or classes): “1~5 and 6~10, which represents bad (or 0) and good (or 1) respectively”
.

In [None]:
# convert imdb_score score to good or bad movie (1~5 bad, 6~10 good)
# https://stackoverflow.com/questions/43232753/how-to-change-the-values-of-a-column-based-on-two-conditions-in-python

dfnew['movie_grade'] = 0 # bad
dfnew.loc[df['imdb_score'] > 6,'movie_grade'] = 1 # good
dfnew.head(2)

### Exclude imdb_score in X variables since the column imdb_score is basically same as the newly created binary column.

In [None]:
# remove imdb_score column
df2=dfnew.drop('imdb_score', axis=1)
df2.head()

### Not all variables need to be used as X variables, but it is important to include all the relevant variables as X to increase the model accuracy.

In [None]:
#check data types
df2.info()

In [None]:
# remove irrelevant columns(duration,facenumber_in_poster,color_ Black and White,color_Color,budget )
df2 = df2.drop(['duration','facenumber_in_poster','color_ Black and White','color_Color','budget'], axis=1)
df2.head()

In [None]:
# caculate the number of good movies and bad movies
ax=df2['movie_grade'].value_counts().plot(kind='bar',rot=45)
for p in ax.patches:
    ax.annotate(str(p.get_height()), (p.get_x() * 1.005, p.get_height() * 1.005))

* As we can see from the chart above, good movies is **2592** and bad movies are **1132**.

### It is important that you use at least three different classification algorithms we have learned and evaluate model quality.


In [None]:
#Import Packages
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

# Classifiers（algorithm for classification）
#import decisiontreeclassifier
from sklearn import tree
from sklearn.tree import export_text
from sklearn.tree import DecisionTreeClassifier
#import logisticregression classifier
from sklearn.linear_model import LogisticRegression
import statsmodels.api as sm
#import knn classifier
from sklearn.neighbors import KNeighborsClassifier
#import randomforest classifier
from sklearn.ensemble import RandomForestClassifier

#for validating your classification model
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn import metrics
from sklearn.metrics import roc_curve, auc
from sklearn.metrics import roc_auc_score

# feature selection
from sklearn.feature_selection import RFE
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2

# grid search
from sklearn.model_selection import GridSearchCV

#### Model #1 KNN

##### Model Building & Validation

In [None]:
# declare X variables and y variable
y = df2['movie_grade'] 
X = df2.drop(['movie_grade'], axis =1)
X.head()

##### Search for the optimal k value (GridSearch)

In [None]:
# evaluate the model by splitting into train and test sets & develop knn model (name it as knn)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)


# Initialize knn Classifier() ... name your decision model "knn"
knn=KNeighborsClassifier()

# Train a knn model
knn.fit(X_train, y_train) 

In [None]:
#create a dictionary of all values we want to test for n_neighbors
params_knn = {'n_neighbors': np.arange(1, 25)}# frind the best k value from 1 to 25

#use gridsearch to test all values for n_neighbors
knn_gs = GridSearchCV(knn, params_knn, cv=5)

#fit model to training data
knn_gs.fit(X_train, y_train)

In [None]:
#save best model
knn_best = knn_gs.best_estimator_

#check best n_neigbors value
print(knn_gs.best_score_)
print(knn_gs.best_params_)
print(knn_gs.best_estimator_)

* From the above result, we can see that the **optimal k value should be 21 , so we set our n_neighbors to 21.**

In [None]:
# evaluate the model by splitting into train and test sets & develop knn model (name it as knn)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)



# Initialize knn Classifier() ... name your decision model "knn"
knn=KNeighborsClassifier(n_neighbors=21)

# Train a knn model
knn.fit(X_train, y_train) 

In [None]:
#Model evaluation
# http://scikit-learn.org/stable/modules/model_evaluation.html

print(metrics.accuracy_score(y_test, knn.predict(X_test)))
print("--------------------------------------------------------")
print(metrics.confusion_matrix(y_test, knn.predict(X_test))) 
print("--------------------------------------------------------")
print(metrics.classification_report(y_test, knn.predict(X_test)))
print("--------------------------------------------------------")
print(metrics.roc_auc_score(y_test, knn.predict(X_test)))

* The knn model is 68.5% accurate. Therefore, we expect that the model will be about **69% accurate** when the model is applied into a real-world situation.The roc_auc_scoreis **0.53**.

#### Model #2 Logistic regression using Recursive Feature Selection (RFE)

In [None]:
#import decisiontreeclassifier
from sklearn import tree
from sklearn.tree import DecisionTreeClassifier, export_graphviz
from IPython.display import SVG
#from graphviz import Source
from IPython.display import display
#import logisticregression classifier
from sklearn.linear_model import LogisticRegression
import statsmodels.api as sm
#import knn classifier
from sklearn.neighbors import KNeighborsClassifier

#for validating your classification model
from sklearn.metrics import classification_report, f1_score, accuracy_score, confusion_matrix
from sklearn.model_selection import StratifiedKFold, cross_val_score, train_test_split, GridSearchCV 
from sklearn import metrics
from sklearn.metrics import roc_curve, auc

# feature selection
from sklearn.feature_selection import RFE
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2

#pip install scikit-plot (optional)
import scikitplot as skplt

import warnings
warnings.filterwarnings("ignore")

In [None]:
model = LogisticRegression()
rfe = RFE(model, 5)  #asking five best attributes
rfe = rfe.fit(X, y)
# summarize the selection of the attributes
print((rfe.support_))
print((rfe.ranking_))

In [None]:
# Features sorted by their rank
pd.DataFrame({'feature':X.columns, 'importance':rfe.ranking_})

In [None]:
#select 5 most significant features only 
X_logistic = df2[['num_critic_for_reviews', 'actor_1_facebook_likes', 'cast_total_facebook_likes', 'num_user_for_reviews', 'actor_2_facebook_likes']]
print(X_logistic.head())

In [None]:
## develop logistic regression model with X_logistic (only 5 predictors or independent variables)
# evaluate the model by splitting into train and test sets and build a logistic regression model
# name it as "lr"
X_train, X_test, y_train, y_test = train_test_split(X_logistic, y, test_size=0.3, random_state=0)
lr = LogisticRegression(solver='lbfgs', max_iter=500)#—_iter refers to iteration
lr.fit(X_train, y_train)

#Model evaluation
print(metrics.accuracy_score(y_test, lr.predict(X_test)))
print(metrics.confusion_matrix(y_test, lr.predict(X_test)))
print(metrics.classification_report(y_test, lr.predict(X_test)))
print(metrics.roc_auc_score(y_test, lr.predict(X_test)))

The Logistic regression model is **70.8%** accurate. Therefore, we expect that the model will be about **71%** accurate when the model is applied into a real-world situation.The roc_auc_scoreis **0.52**.

#### Model #3 Random Forest Classifier

In [None]:
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(n_estimators=20)    #building 20 decision trees
clf=clf.fit(X_train, y_train)
clf.score(X_test, y_test)

In [None]:
# generate evaluation metrics
print(metrics.accuracy_score(y_test, clf.predict(X_test))) #overall accuracy
print(metrics.confusion_matrix(y_test, clf.predict(X_test)))
print(metrics.classification_report(y_test, clf.predict(X_test)))
print(metrics.roc_auc_score(y_test, clf.predict(X_test)))

* The random forest model is 75.6% accurate. Therefore, we expect that the model will be about **76% accurate** when the model is applied into a real-world situation.The roc_auc_scoreis **0.68**.

#### Model#4 Decision Trees Model by using SelectKbest

In [None]:
X_new = SelectKBest(chi2, k=3).fit_transform(X, y)
print(X_new)

In [None]:
# this will help you identify the column indexes (and names)
selector = SelectKBest(chi2, k=3).fit(X, y)
idxs_selected = selector.get_support(indices=True)
print(idxs_selected)

In [None]:
#identify the column  names
X.columns[selector.get_support()]

In [None]:
# Build a decision tree model with those three features ... Split validation:train (70%) and test sets (30%)

# declare X variables and y variable
y = df2['movie_grade'] 
X_new = df2[['gross', 'num_voted_users','movie_facebook_likes']]
X.head()

# split validation
X_train, X_test, y_train, y_test = train_test_split(X_new, y, test_size=0.3, random_state=0)
print(X_train.shape, y_train.shape, X_test.shape, y_test.shape)

# Initialize DecisionTreeClassifier() ... name your decision model "dt"
dt = DecisionTreeClassifier(max_depth=3, min_samples_leaf=5)

# Train a decision tree model
dt.fit(X_train, y_train) 

dt_y = dt.predict(X_test)


#Model evaluation
# http://scikit-learn.org/stable/modules/model_evaluation.html
print(metrics.accuracy_score(y_test, dt.predict(X_test)))
print("--------------------------------------------------------")
print(metrics.confusion_matrix(y_test, dt.predict(X_test))) 
print("--------------------------------------------------------")
print(metrics.classification_report(y_test, dt.predict(X_test)))
print("--------------------------------------------------------")
print(metrics.roc_auc_score(y_test, dt.predict(X_test)))

* The Decision tree model is 74.1% accurate. Therefore, we expect that the model will be about **74% accurate** when the model is applied into a real-world situation.The roc_auc_scoreis **0.65**.

#### Model#5 GradientBoostingClassifier

In [None]:
# import advanced algorthms
from sklearn.neural_network import MLPClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.svm import SVC

In [None]:
# split validation
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

# initialize 
gb = GradientBoostingClassifier(n_estimators=100, random_state=0)

# fit the model
gb.fit(X_train, y_train)

In [None]:
#Model evaluation
# http://scikit-learn.org/stable/modules/model_evaluation.html
print(metrics.accuracy_score(y_test, gb.predict(X_test)))
print("--------------------------------------------------------")
print(metrics.confusion_matrix(y_test, gb.predict(X_test))) 
print("--------------------------------------------------------")
print(metrics.classification_report(y_test, gb.predict(X_test)))
print("--------------------------------------------------------")
print(metrics.roc_auc_score(y_test, gb.predict(X_test)))

* The GradientBoosting model is 75.8% accurate. Therefore, we expect that the model will be about **76% accurate** when the model is applied into a real-world situation.The roc_auc_scoreis **0.67**.

#### Model#6 Support Vector Machine (SVM)

In [None]:
# initialize 
svm = SVC(gamma='auto', probability=True)
# fit the model
svm.fit(X_train, y_train)

In [None]:
#Model evaluation
# http://scikit-learn.org/stable/modules/model_evaluation.html
print(metrics.accuracy_score(y_test, svm.predict(X_test)))
print("--------------------------------------------------------")
print(metrics.confusion_matrix(y_test, svm.predict(X_test))) 
print("--------------------------------------------------------")
print(metrics.classification_report(y_test, svm.predict(X_test)))
print("--------------------------------------------------------")
print(metrics.roc_auc_score(y_test, svm.predict(X_test)))

* The Support Vector Machine model is 70.3% accurate. Therefore, we expect that the model will be about **70% accurate** when the model is applied into a real-world situation.The roc_auc_scoreis **0.50**.

#### Model#7 Neural Network

In [None]:
nn = MLPClassifier(solver='lbfgs', max_iter=500,random_state=0)
nn.fit(X_train, y_train)

In [None]:
#Model evaluation
# http://scikit-learn.org/stable/modules/model_evaluation.html
print(metrics.accuracy_score(y_test, nn.predict(X_test)))
print("--------------------------------------------------------")
print(metrics.confusion_matrix(y_test, nn.predict(X_test))) 
print("--------------------------------------------------------")
print(metrics.classification_report(y_test, nn.predict(X_test)))
print("--------------------------------------------------------")
print(metrics.roc_auc_score(y_test, nn.predict(X_test)))

* The Neural Network model is 70.2% accurate. Therefore, we expect that the model will be about **70% accurate** when the model is applied into a real-world situation.The roc_auc_scoreis **0.5**.

### What is your best classification model? What is the model accuracy? True positive rate? False positive rate? What is ROC score?

* The best classification model is **Random Forest Model**.  
* The model accuracy is **76%**.
* True positive rate is **0.87**. 
* False positive rate is **0.5**
* The ROC score is **0.68**.

•	679: those good movies, the model correctly classify them as good movies;
•	163: those bad movies, the model incorrectly classify them as good movies,
•	170:those bad movies, the model correctly classify them as bad movies;
•	106: those good movies, the model incorrectly classify them as bad movies.


## Part IV. Clustering

### Analyze the data using K-means algorithm. Use the Elbow method to determine the optimal K value for Kmeans analysis. 

In [None]:
# import packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

from sklearn.cluster import KMeans

from sklearn.cluster import AgglomerativeClustering
from sklearn.metrics import pairwise_distances

In [None]:
df3=dfnew.drop(['movie_grade','color_Color','color_ Black and White'], axis=1)
df3.head()

In [None]:
# variance test
df3.var()

In [None]:
#Normalize data 

X = (df3 - df3.mean()) / (df3.max() - df3.min())
X.head()

In [None]:
# variance test again

X.var()

* From the variance test result we can see that all the variables are in the same scale.

In [None]:
# determine an optimal value of k using "Elbow" method
from scipy.spatial.distance import cdist 

K = list(range(1, 10)) 

meandistortions = []

for k in K: 
    kmeans = KMeans(n_clusters=k, random_state=1) 
    kmeans.fit(X) 
    meandistortions.append(sum(np.min(cdist(X, kmeans.cluster_centers_, 'euclidean'), axis=1)) / X.shape[0]) 

plt.plot(K, meandistortions, 'bx-') 
plt.xlabel('k') 
plt.ylabel('Average distortion') 
plt.title('Selecting k with the Elbow Method')

* From the chart above, we decide **k=4**.

#### Clustering analysis (k = 4): Include "random_state=0"

In [None]:
# clustering analysis using k-means
k_means=KMeans(init='k-means++',n_clusters=4,random_state=0)
k_means.fit(X)

In [None]:
# cluster labels

k_means.labels_

In [None]:
# find out cluster centers

k_means.cluster_centers_

In [None]:
# convert cluster lables to dataframe

df4 = pd.DataFrame(k_means.labels_, columns = ['cluster'])
df4.head()

In [None]:
# joining two dataframes

df3= df3.reset_index(drop=True)# reset_index after dealing with the original data
df4= df4.reset_index(drop=True)

df5 = df3.join(df4)
df5.head()

In [None]:
df5.tail()

### This is exploratory data analysis, and you need to report the movie (or cluster) “profiles” based on clustering analysis.


In [None]:
#How many observations are there in each cluster 
df5.groupby(['cluster']).size()

* In cluster 0 there are **1909** observations, in cluster 1 there are **162** observations, in cluster 2 there are **1100** observations,in cluster 3 there are **553** observations.

In [None]:
#The mean values of each cluster in terms of different variables
df5.groupby(['cluster']).mean()

In [None]:
df5.groupby(['cluster']).mean().T

## profiling 
1. cluster 0: 
* relatively low **num_critic_for_reviews**, 
* relatively low **duration**,
* relatively low **director_facebook_likes**,
* relatively low **actor_3_facebook_likes**,
* relatively low **actor_1_facebook_likes**,
* relatively low **gross**,
* relatively low **num_voted_users**	,
* relatively low **cast_total_facebook_likes**, 
* relatively low **facenumber_in_poster**,
* relatively low **num_user_for_reviews**,
* relatively low **budget**,
* relatively low **actor_2_facebook_likes**,
* relatively low **imdb_score**,
* relatively low **aspect_ratio**,
* relatively low **movie_facebook_likes**


2. cluster 1: 
* relatively high **num_critic_for_reviews**, 
* high **duration**,
* high **director_facebook_likes**,
* relatively high **actor_3_facebook_likes**,
* high **actor_1_facebook_likes**,
* relatively high **gross**,
* relatively high **num_voted_users**,
* high **cast_total_facebook_likes**,
* low **facenumber_in_poster**,
* relatively high **num_user_for_reviews**,
* relatively high **budget**,
* relatively high **actor_2_facebook_likes**,
* high **imdb_score**,
* relatively high **aspect_ratio**,
* relatively high **movie_facebook_likes**.


3. cluster 2: 
* low **num_critic_for_reviews**,
* low **duration**, 
* low **director_facebook_likes**,
* low  **actor_3_facebook_likes**,
* low **actor_1_facebook_likes**,
* low **gross**, 
* low **num_voted_users**, 
* low **cast_total_facebook_likes**,
* high **facenumber_in_poster**, 
* low **num_user_for_reviews**,
* low **budget**, 
* low **actor_2_facebook_likes**,
* low **imdb_score**,
* low **aspect_ratio**, 
* low **movie_facebook_likes**.


3. cluster 3: 
* high **num_critic_for_reviews**,
* relatively high **duration**, 
* relatively high **director_facebook_likes**,
* high  **actor_3_facebook_likes**,
* relatively high **actor_1_facebook_likes**,
* high **gross**, 
* high **num_voted_users**, 
* relatively high **cast_total_facebook_likes**,
* relatively high **facenumber_in_poster**,
* high **num_user_for_reviews**,
* high **budget**, 
* high **actor_2_facebook_likes**,
* relatively high **imdb_score**,
* high **aspect_ratio**, 
* high **movie_facebook_likes**.

## Part V. Storytelling

### At the end, this is what your client is interested in. Develop useful insights from your correlation analysis and machine learning models (regression, classification, and clustering). Write a summery using bulleted lists and/or numbers in markdown cells. If this section is “too thin”, your project will receive a low grade.

* imdb_score has high positive correlation with num_voted_users, duration,num_critic_for_reviews and num_user_for_reviews, which means high num_voted_users,high duration,high num_critic_for_reviews and high num_user_for_reviews, high imdb_score.
* imdb_score has negative correlation with facenumber_in_poster and Color_color,which means _low facenumber_in_poster high imdb_score._
* High cast_total_facebook_likes, high imdb_score.
* High director_facebook_likes, high imdb_score.
* High actor_3_facebook_likes, , high imdb_score.
* High actor_1_facebook_likes, high imdb_score.
* High actor_2_facebook_likes, high imdb_score.
* High budget, high imdb_score.
* Our best **regression model** indicates that **duration and facenumber_in_poster** play a very important role in the imdb_score. With the increase of the duration, the imdb_score also increases, and with the increase of facenumber_in_poster, the imdb_score decreases. 
* After building, validating and evaluating multiple **classification models**, we can see that num_critic_for_reviews, director_facebook_likes , actor_3_facebook_likes, actor_1_facebook_likes , gross, num_voted_users, cast_total_facebook_likes , num_user_for_reviews, actor_2_facebook_likes, aspect_ratio, movie_facebook_likes are important variables in predicting whether the movie is good or bad. Using the 11 features, our test result using **random forest** in predicting whether people leave or not is **76% accurate**.
* We use K-means algorithm build a **clustering model** and divide the whole dataset into 4 clusters. In cluster 0 there are 1909 observations, in cluster 1 there are 162 observations, in cluster 2 there are 1100 observations, in cluster 3 there are 553 observations.In Cluster 0, the 11 features(num_critic_for_reviews, duration, director_facebook_likes, actor_3_facebook_likes, actor_1_facebook_likes, gross, num_voted_users, cast_total_facebook_likes, facenumber_in_poster, num_user_for_reviews,  budget, actor_2_facebook_likes,  imdb_score, aspect_ratio,  movie_facebook_likes) are all relatively low. In Cluster 2, all features except  facenumber_in_poster are low. Cluster 1 has the highest imdb_score, its features are either high or relatively high, except facenumber_in_poster, which is the lowest. Cluster 3 has the highest gross, its features are either high or relatively high. 
* From the data, we can say that a good movie don’t need too many faces in a poster.

