# ML innlevering 2 - TMDB Box Office Prediction

DET VI SKAL PREDICTE ER REVENUE PÅ HVER AV ID-ENE ("try and predict their overall worldwide box office revenue")

During this project we will use the eight steps in Appendix B:
1. Look at the big picture.
2. Get the data.
3. Discover and visualize the data to gain insights.
4. Prepare the data for Machine Learning algorithms.
5. Select a model and train it.
6. Fine-tune your model.
7. Present your solution.
8. Launch, monitor, and maintain your system.

# Hei og velkommen

#### Get the data:

In [78]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib
import sklearn

In [79]:
#lists the files in the folder
import os
print(os.listdir("data"))

['test.csv', 'train.csv', 'sample_submission.csv']


In [80]:
#Reads in the csv-files and creates a dataframe using pandas

train = pd.read_csv('data/train.csv')
test = pd.read_csv('data/test.csv')
sampleSubmission = pd.read_csv('data/sample_submission.csv')

## Explore the data

In [None]:
train.head()

In [None]:
train.info()

Info gives a clean list of all the features in the train set. Most of the features are objects, except a few which are ints or floats. Info also contains info about the Non-Null numbers. Some features contains a smaller amount of Non-Null numbers, whoch means there are missing data. The column "belongs_to_collection" have a very small Non-Null so the feature probably have a lot of missing data. Lets go deeper:

The code underneath provides an overview of missing values in train. The missing values makes it harder to give good predictions, if it is not handled correctly.

In [None]:
print(train.isnull().sum())

Homepage are one of the features that has a lot of missing values. This feature is not seen as a important feature in order to predict revenue. Therefore this feature will be dropped. 

In this first round we are dropping all the features that has missing values, and then we will keep the necessary ones and calculate the missing values

In [None]:
X = train.drop(['id', 'belongs_to_collection', 'genres', 'homepage',
            'overview', 'poster_path','production_companies',
            'production_countries','runtime', 'spoken_languages',
            'tagline','Keywords','cast', 'crew'],axis=1)


#train.drop(['homepage','imdb_id','belongs_to_collection',
 #           'genres','overview','production_companies',
  #          'production_countries','poster_path','spoken_languages',
   #         'tagline','Keywords','crew','cast'],axis=1)

In [None]:
X.head()

In [None]:
#for i, e in enumerate(train['belongs_to_collection'][:5]):
  #  print(i, e)

In [None]:
X_test = test.drop("id", axis=1)
X_test.head()

In [None]:
#train['belongs_to_collection'].apply(lambda x: len(x) if x != {} else 0).value_counts()

### Lets also understand the popularity of the genres

In [None]:
#for i, e in enumerate(train['genres'][:5]):
   # print(i, e)

In [None]:
#print('Number of genres in films')
#train['genres'].apply(lambda x: len(x) if x != {} else 0).value_counts()

In [None]:
#list_of_genres = list(train['genres'].apply(lambda x: [i['name'] for i in x] if x != {} else []).values)

In [None]:
#plt.figure(figsize = (12, 8))
#text = ' '.join([i for j in list_of_genres for i in j])
#wordcloud = WordCloud(max_font_size=None, background_color='white', collocations=False,
                      #width=1200, height=1000).generate(text)
#plt.imshow(wordcloud)
#plt.title('Top genres')
#plt.axis("off")
#plt.show()
# laga ett tankekart på en måte, der modellen framheva dei mest populære sjangrane. 
# Vil ikkje kjøre pga mangla list_of_genres fra kodelinja over.

In [None]:
X.describe()

If std had been 1 and mean hade been 0, it would have been a perfect result already

### Test set

To split the train set and test set we use the train_test_split method, which will split the data set randomly in two parts.

In [None]:
from sklearn.model_selection import train_test_split
train_set, test_set = train_test_split(train, test_size=0.2, random_state=42)

In [None]:
test.head()

In [None]:
test.info()

In [None]:
test.describe()

Observation: many NaN

In [None]:
len(test)

# Handeling text

Most of the provided values are not numeric, therefore it is important to make all of the objects numeric

In [None]:
train["original_language"].value_counts()

In [None]:
original_language_cat = train[["original_language"]]
original_language_cat.head(10)

In order to compute the median for original_language, it is necessary to convert the category from text to number. In the following code, OrdinalEncoder is used to convert the text. And then OneHotEncoder is used to create an array.

In [None]:
from sklearn.preprocessing import OneHotEncoder 
from sklearn.preprocessing import OrdinalEncoder

In [None]:
ordinal_encoder = OrdinalEncoder()
original_language_cat_encoded = ordinal_encoder.fit_transform(original_language_cat)
original_language_cat_encoded[:10]

In [None]:
cat_encoder = OneHotEncoder()
original_language_cat_hot = cat_encoder.fit_transform(original_language_cat_encoded)
original_language_cat_hot
original_language_cat_hot.toarray()

In [None]:
ordinal_encoder.categories_

#### Original title

In [None]:
train["original_title"].value_counts()

In [None]:
original_title_cat = train[["original_title"]]
original_title_cat.head(10)

In [None]:
original_title_cat_encoded = ordinal_encoder.fit_transform(original_title_cat)
original_title_cat_encoded[:10]

In [None]:
original_title_cat_hot = cat_encoder.fit_transform(original_title_cat_encoded)
original_title_cat_hot
original_title_cat_hot.toarray()

In [None]:
ordinal_encoder.categories_

####  Status

In [None]:
train["status"].value_counts()

In [None]:
status_cat = train[["status"]]
status_cat.head(10)

In [None]:
status_cat_encoded = ordinal_encoder.fit_transform(status_cat)
status_cat_encoded[:10]

In [None]:
status_cat_hot = cat_encoder.fit_transform(status_cat_encoded)
status_cat_hot
status_cat_hot.toarray()

In [None]:
ordinal_encoder.categories_

### Pipeline

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer

In [None]:
num_pipeline = Pipeline([
                ['imputer', SimpleImputer(strategy="median")],                       
  ])

In [None]:
original_language_cat_attribs = ["original_language"]
num_attribs_language = [f for f in X.columns if f not in original_language_cat_attribs]

In [None]:
full_pipeline = ColumnTransformer([
                      ("num", num_pipeline, num_attribs_language),
                      ("cat", OrdinalEncoder(), original_language_cat),
  ])

X_prepared = full_pipeline.fit_transform(X)
test_prepared = full_pipeline.transform(X_test)