# Machine Learning Checklist
1. Frame the problem and look at the big picture.
2. Get the data.
3. Explore the data to gain insights.
4. Prepare the data to better expose the underlying data patterns to Machine Learning algorithms.
5. Explore many different models and shortlist the best ones.
6. Fine-tune your models and combine them into a great solution.
7. Present your solution.
8. Launch, monitor, and maintain your system.

## i). Frame the problem and Look at the Big Picture.
1. Define the objective in business terms.
    - The objective is to estimate the revenue.
2. How will your solution be used?
    - Gauge profitability of a movie.
3. What are the current solutions/workarounds (if any)?
    - No current solutions are available.
4. How should you frame this problem (supervised/unsupervised, online/offline, etc.)?
    - This is a supervised, offline problem.
5. How should performance be measured?
    - RSME/MSE will be used to gauge performance.
6. Is the performance measure aligned with the business objective?
    - Yes.
7. What would be the minimum performance needed to reach the business objective?
    - A 90% performance is needed.
8. What are comparable problems? Can you reuse experience or tools?
    - Assume none. 
9. Is human expertise available?
    - No
10. How would you solve the problem manually?
    - Monitoring revenue from sales made.
11. List the assumptions you (or others) have made so far.
    - The features predicting revenue are known beforehand.
12. Verify assumptions if possible.
    - These assumption isn't what typically happens as the features and revenue will generally be known cocurrently. 

## ii). Get the Data
1. List the data you need and how much you need.
    - Here, I initially work with data from The Movie Database.
2. Find and document where you can get that data.
    - https://github.com/celiao/tmdbsimple/blob/master/tmdbsimple/movies.py
3. Check how much space it will take.
    - Less than 1GB.
4. Check legal obligations, and get authorization if necessary.
    - Free to use.
5. Get access authorizations.
    - Already have.
6. Create a workspace (with enough storage space).
    - I have and later will create a virtual environment.
7. Get the data.
    - Already have the data.
8. Convert the data to a format you can easily manipulate (without changing the data itself).
    - Will convert during data cleaning.
9. Ensure sensitive information is deleted or protected (e.g.,anonymized).
    - No sensitive information.
10. Check the size and type of data (time series, sample, geographical, etc.).
    - Approximately 50k instances representing various data types. 
11. Sample a test set, put it aside, and never look at it (no data snooping!).
    - Done :)

In [1]:
# EDA libraries.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
# setting column view maximum.
pd.set_option("display.max_colwidth", None)

In [3]:
# load data.
movies = pd.read_csv("../../../Data/archive/movies_metadata.csv", low_memory=False)

In [4]:
# seperate data into train and test.
# this is to avoid data snooping.

from sklearn.model_selection import train_test_split
train_set, test_set = train_test_split(movies, test_size=0.2, random_state=44)

In [5]:
movies = train_set

## iii). Explore the Data
1. Create a copy of the data for exploration (sampling it down to a manageable size if necessary).
    - Done.
2. Create a Jupyter notebook to keep a record of your data exploration.
    - Done.
3. Study each attribute and its characteristics:
    - Name
    - Type (categorical, int/float, bounded/unbounded, text, structured, etc.)
    - % of missing values
    - Noisiness and type of noise (stochastic, outliers, rounding errors, etc.)
    - Usefulness for the task
    - Type of distribution (Gaussian, uniform, logarithmic, etc.)
    - **[Above have being done but not entirely due to lack of skill]**
4. For supervised learning tasks, identify the target attribute(s).
    - Identified.
5. Visualize the data.
6. Study the correlations between attributes.
7. Study how you would solve the problem manually.
8. Identify the promising transformations you may want to apply.
9. Identify extra data that would be useful (go back to “Get the Data”).
10. Document what you have learned.

In [11]:
explo_data = movies.copy()

In [13]:
explo_data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 36372 entries, 9522 to 14100
Data columns (total 24 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   adult                  36372 non-null  object 
 1   belongs_to_collection  3626 non-null   object 
 2   budget                 36372 non-null  object 
 3   genres                 36372 non-null  object 
 4   homepage               6245 non-null   object 
 5   id                     36372 non-null  object 
 6   imdb_id                36358 non-null  object 
 7   original_language      36362 non-null  object 
 8   original_title         36372 non-null  object 
 9   overview               35594 non-null  object 
 10  popularity             36367 non-null  object 
 11  poster_path            36069 non-null  object 
 12  production_companies   36369 non-null  object 
 13  production_countries   36369 non-null  object 
 14  release_date           36297 non-null  object 
 15  reve

In [15]:
explo_data.describe(include='O')

Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,popularity,poster_path,production_companies,production_countries,release_date,spoken_languages,status,tagline,title,video
count,36372,3626,36372,36372,6245,36372,36358,36362,36372,35594,36367.0,36069,36369,36369,36297,36366,36301,16402,36366,36366
unique,5,1592,1073,3531,6171,36354,36338,88,34964,35443,35120.0,36037,18600,2052,15478,1636,6,16300,34228,2
top,False,"{'id': 421566, 'name': 'Totò Collection', 'poster_path': '/4ayJsjC3djGwU9eCWUokdBWvdLC.jpg', 'backdrop_path': '/jaUuprubvAxXLAY5hUfrNjxccUh.jpg'}",0,"[{'id': 18, 'name': 'Drama'}]",http://www.georgecarlin.com,141971,tt1180333,en,Cinderella,No overview found.,0.0,/2kslZXOaW0HmnGuVPCnQlCdXFR9.jpg,[],"[{'iso_3166_1': 'US', 'name': 'United States of America'}]",2007-01-01,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Based on a true story.,Cinderella,False
freq,36363,24,29286,3986,10,3,3,25806,7,102,55.0,4,9470,14223,102,17920,35998,7,10,36297


In [17]:
explo_data.describe()

Unnamed: 0,revenue,runtime,vote_average,vote_count
count,36366.0,36156.0,36366.0,36366.0
mean,11159060.0,94.190452,5.618737,109.285926
std,64135620.0,37.946414,1.922103,485.863405
min,0.0,0.0,0.0,0.0
25%,0.0,85.0,5.0,3.0
50%,0.0,95.0,6.0,10.0
75%,0.0,107.0,6.8,34.0
max,2787965000.0,1256.0,10.0,14075.0


In [19]:
# will study each attribute.
# I'll focus on the following areas.
# 1. Type of data.
# 2. % of missing data.
# 3. Type of distribution. 

In [21]:
explo_data.columns

Index(['adult', 'belongs_to_collection', 'budget', 'genres', 'homepage', 'id',
       'imdb_id', 'original_language', 'original_title', 'overview',
       'popularity', 'poster_path', 'production_companies',
       'production_countries', 'release_date', 'revenue', 'runtime',
       'spoken_languages', 'status', 'tagline', 'title', 'video',
       'vote_average', 'vote_count'],
      dtype='object')

# These are the observations per attribute.
    1. adult:
        a). has three values that appear corrupt.
        b). appears to be of type boolean.
    2. belongs_to_collection:
         a). data type is a string object which appears to be a dictionary.
         b). only 10% is non null.
    3. budget:
         a). is numeric but there are values that are pure string types.
    4. genres:
         a). data type is a string object which is a list with dictionary objects.
    5. homepage:
         a). data types appears to be web links.
         b). only 17% is non null.
    6. id:
       a). 18 values aren't unique.
    7. imdb_id:
            a). 34 values aren't unique.
    8. original_language:
                      a). 10 values are null.
                      b). some values are numeric whereas the rest are strings.
    9. original title:
                   a). data type is strings. 
    10. overview:
              a). 2% is null.
              b). data type is string.
    11. popularity:
              a). 0.01% is null.
              b). data type is numeric however there are pure strings present.
    12. poster_path:
                 a). 0.8% is null.
                 b). data type appears path file to a jpg.
    13. production_companies:
                          a). 0.008% is null.
                          b). data type is a string of list with dictionary objects.
    14. production_countries:
                          a). 0.008% is null.
                          b). data type is a string of list with dictionary objects.
    15. release_date:
                  a). 0.2% is null.
                  b). date type is datetime.
    16. revenue:
             a). 0.016% is null.
             b). data type is numeric.
    17. runtime: 
             a). 0.6% is null.
             b). data type is numeric.
    18. spoken_languages:
                      a). 0.016% is null.
                      b). data type is string of list with dictionary objects.
    19. status:
            a). 0.195% is null.
            b). data type is categorical.
    20. tagline:
             a). 45% is null.
             b). data type is string.
    21. title:
           a). 0.016% is null.
           b). data is string.
    22. video:
           a). 0.016% is null.
           b). data type is boolean.
    23. vote_average:
                  a). 0.016% is null.
                  b). data type is numeric.
    24. vote_count:
                a). 0.016% is null.
                b). data type is numeric.

### Important notes after Data Exploration.
1). There are missing values in the following columns:
- belongs_to_collection (90%).
- homepage (83%)
- tagline (45%)
- overview (2%)
- poster_path (0.8%)
- runtime (0.6%)
- release_date (0.2%)
- status (0.195%)
- revenue (0.016%)
- spoken_languages (0.016%)
- title (0.016%)
- video (0.016%)
- vote_average (0.016%)
- vote_count (0.016%)
- popularity (0.01%)
- production_companies (0.008%)
- production_countries (0.008%)

2). It's hard to visualize the data without first cleaning it.

In [23]:
# below is the data cleaning process for each attribute.
# will use classes so as to facilitate use of pipelines.
# three classes will be created.
# one for numeric columns cleaning.
# second for strings of list objects.
# third for string objects.

In [25]:
from sklearn.base import BaseEstimator, TransformerMixin

In [27]:
numeric_cols = ["budget", "popularity", "revenue", "runtime", "vote_average", "vote_count"]
class NumericCleaning(BaseEstimator, TransformerMixin):
    def __init__(self):
        pass
    
    def fit(self, X, y=None):
        return self
    
    def transform(self, X, y=None):
        for col in numeric_cols:
            for key, val in X[col].items(): 
                try:
                    float(val)
                except:
                    X.loc[key,col] = None
        return X     

In [29]:
numeric_clean = NumericCleaning()
explo_data = numeric_clean.transform(explo_data)

In [31]:
import ast
list_dict_cols = ["genres", "production_companies", "production_countries", "spoken_languages"]
class ToListDictionary(BaseEstimator, TransformerMixin):
    def __init__(self):
        pass
    
    def fit(self, X, y=None):
        return self
    
    def change(self, val):
        try: 
            return ast.literal_eval(val)
        except:
            return []
    
    def transform(self, X, y=None):
        for col in list_dict_cols:
            X[col + "_edit"] = X[col].apply(self.change)
        return X

In [33]:
to_list = ToListDictionary()
explo_data = to_list.transform(explo_data)
explo_data.drop(list_dict_cols, axis=1, inplace=True)

In [34]:
string_cols = ["adult", "original_language", "original_title", 
                "overview", "status", "tagline", "title", "video"]

class StringCleaning(BaseEstimator, TransformerMixin):
    def __init__(self):
        pass
    
    def fit(self, X, y=None):
        return self
    
    def transform(self, X, y=None):
        for col in string_cols:
            for key, val in X[col].items(): 
                try:
                    float(val)
                    X.loc[key, col] = None
                except:
                    X.loc[key,col] = val
        return X     

In [None]:
str_clean = StringCleaning()
explo_data = str_clean.transform(explo_data)

In [None]:
explo_data.info()

In [3]:
sr = [1, 2, 3, 4]

NameError: name 'join' is not defined