# Below is how I approach the Movie project
### 1. Frame the Problem and Look at the Big Picture
1. Define the objective in business terms.
    - There are two objectives namely:
        1. Knowing the projected revenue.
        2. Determinig customer interests.
2. How will your solution be used?
    - By companies trying to understand their profits.
    - Movie lovers can get recommendation of movies.
3. What are the current solutions/workarounds (if any)?
    - Can get data from TMDB and Rotten tomatoes (currently use kaggle data).
    - For data analysis and machine learning assume no solutions.
4. How should you frame this problem (supervised/unsupervised,online/offline, etc.)?
    - Supervised and batch learning.
        NB: online learning might be possible. Will experiment at end of project.
5. How should performance be measured?
    - For revenue predictions RSME will be used.
    - For recommender system F1-score will be used.
        NB: For both models will experiment other performance measures. 
6. Is the performance measure aligned with the business objective?
    - For current business needs yes.
    - However, if this is a real world scenario I believe not.
7. What would be the minimum performance needed to reach the business objective?
    - For revenue a score of 94% is acceptable. 
    - For recommender system generating high recall score is preferable. 
8. What are comparable problems? Can you reuse experience or tools?
    - Assume none exist.
9. Is human expertise available?
    - Assume no.
10. How would you solve the problem manually?
    - For revenue talking to people in movie industry.
    - For recommendor asking random people.
11. List the assumptions you (or others) have made so far.
    - None identified at this point. 
12. Verify assumptions if possible.
    - None to verify. 

In [1]:
# Deleted all the previous code due to data.
# However, decided on using the same data for now.
# Fetching my own data a bit hard and will deviate me from main goal.
# Below is the python package should I want to use my own data.
# https://github.com/celiao/tmdbsimple/blob/master/tmdbsimple/movies.py

In [2]:
# EDA libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [3]:
# setting column view maximum
pd.set_option("display.max_colwidth", None)

In [7]:
# load data
movies=pd.read_csv("../../../Data/archive/movies_metadata.csv", low_memory=False)

In [8]:
# seperate data into train and test.
# this is to avoid data snooping
from sklearn.model_selection import train_test_split
train_set, test_set = train_test_split(movies, test_size=0.2, random_state=34)

In [9]:
# use train_set for EDA and ML
movies = train_set

In [10]:
movies.info()

<class 'pandas.core.frame.DataFrame'>
Index: 36372 entries, 36393 to 11681
Data columns (total 24 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   adult                  36372 non-null  object 
 1   belongs_to_collection  3600 non-null   object 
 2   budget                 36372 non-null  object 
 3   genres                 36372 non-null  object 
 4   homepage               6239 non-null   object 
 5   id                     36372 non-null  object 
 6   imdb_id                36357 non-null  object 
 7   original_language      36361 non-null  object 
 8   original_title         36372 non-null  object 
 9   overview               35620 non-null  object 
 10  popularity             36368 non-null  object 
 11  poster_path            36054 non-null  object 
 12  production_companies   36370 non-null  object 
 13  production_countries   36370 non-null  object 
 14  release_date           36300 non-null  object 
 15  rev

In [24]:
# function to count dtypes per column
def count_dtypes(data):
    return type(data)

In [40]:
# function to retrieve columns with differing data types
def multi_dtypes_columns(data):
    arr = np.array([])
    for var in data.columns:
        type_count = (data[var].apply(count_dtypes)).value_counts()
        if len(type_count) > 1:
            arr = np.append(arr, var)
    return arr
cols = multi_dtypes_columns(movies)

In [42]:
diff_col_movies = movies[cols]
diff_col_movies.info()

<class 'pandas.core.frame.DataFrame'>
Index: 36372 entries, 36393 to 11681
Data columns (total 15 columns):
 #   Column                 Non-Null Count  Dtype 
---  ------                 --------------  ----- 
 0   belongs_to_collection  3600 non-null   object
 1   homepage               6239 non-null   object
 2   imdb_id                36357 non-null  object
 3   original_language      36361 non-null  object
 4   overview               35620 non-null  object
 5   popularity             36368 non-null  object
 6   poster_path            36054 non-null  object
 7   production_companies   36370 non-null  object
 8   production_countries   36370 non-null  object
 9   release_date           36300 non-null  object
 10  spoken_languages       36367 non-null  object
 11  status                 36296 non-null  object
 12  tagline                16309 non-null  object
 13  title                  36367 non-null  object
 14  video                  36367 non-null  object
dtypes: object(15)
memory

In [None]:
# function 