In [10]:
import pandas as pd
import os

# path to parent folder (works in .py IDEs)
#path = os.path.abspath(os.path.join(os.path.dirname(__file__), ".."))

# path to parent folder (works in .ipynb)
path = os.getcwd()[:-5]

## Dataframe preparation

Create dataframe of users, movies and reviews. The considered movies should be ones contained both in 5000 dataset and reviews dataset.

### Get a list of movies from 5000 dataset

In [11]:
movies_5000 = pd.read_csv(os.path.join(path, "dataset_5000", "tmdb_5000_movies.csv"),
                          usecols = ["original_title"])
movies_5000 = movies_5000.original_title.tolist()

### Import reviews

Dataframe containing username, movie title, production year and specific review is being prepared. Brief description of procedure is as below:

<ol>
    <li> Import data and perform initial cleaning (used data is generally clean and do not require much  </li>
    <li> Create list of movies that appear in both '5000' and 'reviews' datasets. </li>
    <li> Import all reviews and concatenate them into 1 panda dataframe. </li>
</ol>

In [12]:
# list of file paths to import
file_paths = []
for root, _, files in os.walk(os.path.join(path, "dataset_review\\2_reviews_per_movie_raw")):
    for filename in files:
        file_paths.append(os.path.join(root, filename))

# create dictionary with movie title and corresponding path
movies = dict()
for movie in file_paths:
    #title = os.path.basename(movie)[:-8].replace("_", ":")
    movies[os.path.basename(movie)[:-9].replace("_", ":")] = movie
    
# adjust some titles:
movies["50/50"] = movies.pop("50:50")
movies["Face/Off"] = movies.pop("Face:Off")
movies["Frost/Nixon"] = movies.pop("Frost:Nixon")

In [26]:
# create main dataframe
movies_final = pd.DataFrame()
for key in movies:
    if key in movies_5000:
        temp = pd.read_csv(movies[key], usecols = ["username", "review"])
        temp.insert(1, "title", key)
        temp["year"] = os.path.basename(movies[key])[-8:-4]
        movies_final = pd.concat([movies_final, temp])
        
movies_final.reset_index(inplace = True)

Unnamed: 0,index,username,title,review,year
0,0,Imme-van-Gorp,10 Cloverfield Lane,This movie is full of suspense. It makes you g...,2016
1,1,sonofocelot-1,10 Cloverfield Lane,I'll leave this review fairly concise. <br/><b...,2016
2,2,mhodaee,10 Cloverfield Lane,I give the 5/10 out of the credit I owe to the...,2016
3,3,fil-nik09,10 Cloverfield Lane,"First of all, I must say that I was expecting ...",2016
4,4,DVR_Brale,10 Cloverfield Lane,I've always loved movies with strong atmospher...,2016
...,...,...,...,...,...
660557,257,JoeB131,Frost/Nixon,"Oddly, it's the signature Nixon line, and Lang...",2008
660558,258,wolftab-1,Frost/Nixon,A fine rendition of the play of the same name....,2008
660559,259,leplatypus,Frost/Nixon,"With 3 free tickets to use before 12 April, i ...",2008
660560,260,antoniotierno,Frost/Nixon,Both Langella and Sheen don't look that like t...,2008


## Exporting data

movies_final dataframe is containing all reviews for 825 movies, that are both in 5000 movies dataset and in reviews dataset. 

### Whole dataframe

The columns in relation are ['username', 'title', 'review']; exported file is very heavy (>800 MB).

In [13]:
#movies_final.to_csv(os.path.join(path, "full_review_data.csv"))

### Network dataframe

Exported relation contains only attributes ['username', 'title']. 

In [27]:
#movies_final[["username", "title", "year"]].to_csv(os.path.join(path, "Network", "net_data.csv"), index = False)