 # B1. Data set selection and description
For this project we are aiming to differentiate between a real movie review and a bot/AI generated review. To do this we have collected a huge number of real movie reviews which we have got from this website of cornell university called [The Movie Review Data](http://www.cs.cornell.edu/people/pabo/movie-review-data/). This website has review data in different formats but since we are interested to work with subjective and qualitative data we took the ones with the text reviews. The archive file has almost 1400 review as text files sorted as positive and negative. Since we have a bottleneck in generating fake movie review we picked only 250 positive and 250 negative = 500 real movie reviews. After getting the real reviews we have tried to find the online dataset sources where we could find fake reviews but we could not find any dataset or website with text where there are fake/AI generated movie review data. So we decided to generate AI written movie reviews from an website called [inferkit](https://app.inferkit.com/generate). Since we want to make our data balanced we made a batch request of creating reviews of 500 movies which returned us with an csv file with movie name and corresponding generated review. Now we have a balanced dataset of 500+500 movie reviews and we will move to our next step.

In [1]:
import glob

import pandas as pd
import os


# B3. Data preparation including transforms, scaling, re-shaping and any feature selection to reduce dimensionality. Summary of cleaned/pre-processed data.
Our data is stored in the `data` folder where we have the **_fake_** reviews stored in the `generated_reviews_500.csv` and the **_real_** reviews are stored as text files in the `real_reviews` folder

In [5]:
def read_files(text_files_dir):
    path = text_files_dir
    files = glob.glob(os.path.join(path, '*.txt'))
    real_reviews = []
    for file in files:
        with open(file, 'r') as f:
            real_reviews.append(f.read())

    return real_reviews


reviews_list = read_files('data/real_reviews')
# convert it to a pandas dataframe
reviews_real = pd.DataFrame(reviews_list, columns=['text'])
print(reviews_real.head())

reviews_generated = pd.read_csv('data/generated_reviews_500.csv')
reviews_generated.head()

                                                text
0  an attempt at florida film noir , palmetto fai...
1  birthdays often cause individuals to access th...
2  summer catch ( 2001 ) . starring freddie prinz...
3  as far as " mystery men " is concerned , the b...
4  the team who brought us 'a fish called wanda' ...


Unnamed: 0,prompt_index,prompt_text,completion_index,completion,full_text,reached_end
0,0,A review of The Social Network,0,", the movie about the founding of Facebook, ha...","A review of The Social Network, the movie abou...",False
1,1,A review of The Last Witch Hunter,0,", a new action-horror film from director Neill...","A review of The Last Witch Hunter, a new actio...",False
2,2,A review of Victor Frankenstein,0,\n\nThe Frankenstein Chronicles\n\nA review of...,A review of Victor Frankenstein\n\nThe Franken...,False
3,3,A review of A Street Cat Named Bob,0,", by James Bowen\n\nA Street Cat Named Bob, by...","A review of A Street Cat Named Bob, by James B...",False
4,4,A review of Green Room,0,\n\nGreen Room\n\nDirected by Jeremy Saulnier\...,A review of Green Room\n\nGreen Room\n\nDirect...,False


### Transforms
we are dropping all columns except the **_full_text_** and renaming it to **_text_** to keep it consistent with the dataframe containing real reviews

In [6]:
# drop all columns except the full_text column
reviews_generated = reviews_generated.drop(['prompt_index',
                                            'prompt_text',
                                            'completion_index',
                                            'completion',
                                            'reached_end'],
                                           axis=1)
# rename the full_text column to text
reviews_generated = reviews_generated.rename(columns={'full_text': 'text'})
reviews_generated.columns

Index(['text'], dtype='object')

after this we are adding a new column to both data frames named **label** containing boolean value. **1** means a real review and **0** means a fake or generated review

In [7]:
reviews_real['label'] = 1
print(reviews_real.head())
print(len(reviews_real))

reviews_generated['label'] = 0
print(reviews_generated.head())
print(len(reviews_generated))


                                                text  label
0  an attempt at florida film noir , palmetto fai...      1
1  birthdays often cause individuals to access th...      1
2  summer catch ( 2001 ) . starring freddie prinz...      1
3  as far as " mystery men " is concerned , the b...      1
4  the team who brought us 'a fish called wanda' ...      1
500
                                                text  label
0  A review of The Social Network, the movie abou...      0
1  A review of The Last Witch Hunter, a new actio...      0
2  A review of Victor Frankenstein\n\nThe Franken...      0
3  A review of A Street Cat Named Bob, by James B...      0
4  A review of Green Room\n\nGreen Room\n\nDirect...      0
500


### Scaling, Re-shaping & Reducing Dimensionality

Now we will merge these 2 dataframes and shuffle them for better mix of data before training

In [8]:
# merge the two dataframes
reviews = pd.concat([reviews_real, reviews_generated], ignore_index=True)
# random shuffle the dataframe
reviews = reviews.sample(frac=1).reset_index(drop=True)
print(reviews.head())

                                                text  label
0  A review of Contagion, the new movie starring ...      0
1  if there has been a better comedy of errors th...      1
2  A review of The Other Woman by Claire Messud\n...      0
3  silent fall has the shreds and shards of a far...      1
4  starring : matt damon , ben affleck , linda fi...      1


# B2. Dataset analysis, visualisation, feature correlation, insights extracted from data visualisation

# B4. Selection of 3-4 algorithms most appropriate. Justify your selection.