## Data Collection

**Note:** The code discussed below will be in the data_colection.py located in the script directory. The data collection step is broken down into two parts: First and Second Request. 

---
### First Request:

The code below was used to conduct the First Request part. Since the project's goal is to create an HBO Max Recommender System, I only collected shows/movies from that platform. Additionally, for project simplicity, I only isolated shows/movies from the U.S. library. To start, I conducted a single pull using the `initial_data` function. The output dictionary contains a key called total_count, which was used as a guiding parameter to dictate when the while loop will stop. This is embedded in the `hbo_content_list` function to collected the remainder data. The collected data are then arranged into a dataframe using the `json_df_content` function and saved as df_1. 

A problem observed from the scraper is that it throws an error if the status code is not equal to 200. Therefore a try and except statement is integrated inside the `hbo_content_list` function. Additionally, I picked each collected data into a .txt file to prevent data loss. I also added print statements to track page count and the number of data collected, making it easy to resume the data collection if the cycle breaks. Lastly, I throttled each request for 5 seconds to comply with politeness and rate-limiting policies. This number chosen was based on a trial and error process.

```
def initial_data():     
    data = just_watch.search_for_item(providers=['hbm'], page=1)
    return [data], data['total_results']

def hbo_content_list():
    data_list, total_size = intial_data() 
    size = len(data_list[0]['items']) 
    page_num = 2 
    while size < total_size:
        try: 
            data = just_watch.search_for_item(providers=['hbm'], page=page_num)
            data_list.append(data)
            size += len(data['items'])
            page_num += 1 
            if size % 30 == 0: 
                print("Number of data pulled:")
                print(size)
                print("Page Number:")
                print(page_num - 1)
            with open('initial_data.txt', 'wb') as output:
                pickle.dump(data_list, output)
            time.sleep(5)
        except:
            pass
    return data_list

def json_df_content():
    content = []
    data_list = hbo_content_list()
    for items in data_list:
        for item in items['items']:
            show = {}
            show['id'] = item['id']
            show['title'] = item['title']
            show['type'] = item['object_type']
            content.append(show)
    return pd.DataFrame(content)

df_1 = json_df_content()
df_1.to_csv('df_1.csv')
```

---

### Second Request:

The code below will conduct the Second Request of the data collection step, which will collect the granular details needed for the EDA and modeling/recommender process. The first function, `add_info`, will take df_1 as input and parse each title's id and type through the JustWatchAPI get_title() method. This method will extract the additional information needed and return it as a JSON file. These are then arranged into a dataframe using the `json_df_add` function with the following columns (id, plot, MPAA/TV rating, genre, popularity score, IMDB rating, and TMDB rating). Like the code block above, the same precautions are used for the second part. Each loops are throttled by 5 seconds. I also pickled collected data to .txt files and added print statements to track the collection process. The second part's output dataframe is concatenated to df_1, creating a final dataframe consisting of 10 columns and 1980 rows and is saved into a CSV file (hbo_data.csv). 

```
df_1 = json_df_content()
df_1.to_csv('df_1.csv')

def add_info(df):
    content = []
    for i in df.index:
        show = just_watch.get_title(title_id = df.loc[i, 'id'], content_type= df.loc[i, 'type'])
        content.append(show)
        with open('raw_data.txt', 'wb') as output:
            pickle.dump(content, output)
        print("Number of data pulled:")
        print(len(content))
        time.sleep(5)
    return content
    
def json_df_add():
    content = []
    raw_data = add_info(df_1)
    for data in raw_data:
        show_info = {}
        show_info['year'] = data.get('original_release_year')
        show_info['plot'] = data.get('short_description')
        show_info['genre'] = data.get('genre_ids')
        show_info['rating'] = data.get('age_certification')
        if data.get('scoring') == None:
                show_info['avg_rating'] = None 
        else:
            for score in data.get('scoring'):
                if score['provider_type'] == 'imdb:score':
                    show_info['avg_rating'] = score['value']
                elif score['provider_type'] == 'tmdb:score':
                    show_info['avg_rating'] = score['value']
                if score['provider_type'] == 'tmdb:popularity':
                    show_info['popularity_score'] = score['value']
        content.append(show_info)
    return pd.DataFrame(content)

df_2 = json_df_add()
df_2.to_csv('df_2.csv')
hbo_data = pd.concat([df_1, add_info_2], axis=1)
hbo_data.to_csv('hbo_data.csv') 

```

---

## Data Cleaning 

Approximately 20% of the data is missing due in part of the API not possessing the information. Therefore, missing values are manually imputed using information from the IMDB website. Ideally, a process similar to the ones above would have been preferable. However, IMDB requires an API key to access its information. Due to this project's timeline, it would not be feasible to wait for a response on their end. Additionally, their scrapper would require re-doing the entire process since each shows and movies will have a different id. The combined data frame was run through `convert_genre()` and `fix_rating()` functions to change the genre ids to their corresponding term and remove the spaces at the end of the MPAA/TV ratings. 

**Note:** `convert_genre()` and `fix_rating()` is located inside the `functions.py` in the script directory.

In [1]:
import script.functions as func
from justwatch import JustWatch
import pandas as pd
import pickle
import autoreload
%load_ext autoreload
%autoreload 2

df = pd.read_csv('../data/hbo_data.csv')
df = func.convert_genre(df)
df = func.fix_rating(df)
df.to_csv('../data/final_hbo_data.csv')