### GOAL
* Manipulate separate CSV files into JSON using Pandas so that the data can be easily inputted into MongoDB Atlas
* Merge separate CSV files based on Movie ID
* Remove unnecessary columns
* Ensure data in all columns are the right type and format 

In [4]:
import pandas as pd

* Reading in movies, links, tags, and ratings CSV files and converting them to dataframes
* Converting the type of movieId to int
* Converting the type of links to string

In [100]:
movies = pd.read_csv('movies.csv')
links = pd.read_csv('links.csv', dtype = str)
links['movieId'] = links['movieId'].astype(int)
ratings = pd.read_csv('ratings.csv')
tags = pd.read_csv('tags.csv')


* Creating a new dataframe dataset
* Merging all of the dataframes on movieId

In [101]:
dataset = pd.merge(movies, tags, on='movieId')
dataset = pd.merge(dataset, links, on='movieId')
dataset = pd.merge(dataset, ratings, on='movieId')

*  Dropping unnecessary columns

In [102]:
dataset = dataset.drop(columns=['userId_x','timestamp_x', 'userId_y', 'timestamp_y'])

* Grouping by movieId
* Calculating the average rating
* Combining all separate tags into a list

In [103]:
dataset = dataset.groupby('movieId', as_index=False).agg({'title': 'first', 'genres' : 'first', 'imdbId': 'first', 'tmdbId': 'first', 'rating': 'mean', 'tag': lambda x: list(x)})

* Rounding the rating to the nearest tenths place
* Formatting links with urls
* Removing repeat tags from the tag lists

In [104]:
dataset['rating'] = dataset['rating'].apply(lambda x: round(x, 1))
dataset['imdbId'] = dataset['imdbId'].apply(lambda x: f'https://www.imdb.com/title/tt{x}/')
dataset['tmdbId'] = dataset['tmdbId'].apply(lambda x: f'https://www.themoviedb.org/movie/{x}/')
dataset['tag'] = dataset['tag'].apply(lambda x: list(set(x)))

* Combining imdbId and tmdbId columns into a new single link column
* Dropping imdbId and tmdbId columns

In [105]:
cols = ['imdbId', 'tmdbId']
dataset['links'] = dataset[cols].values.tolist()
dataset.drop(cols, axis=1, inplace=True)

* Exporting data into a JSON file

In [106]:
dataset.to_json('movies.json', orient = 'records', compression = 'infer')