# Movie Reviews Sentiment Analysis

### Problem Statement
- In this project, we try to estimate the sentiment from a movie review.

### Dataset
- Source - https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
- The data contains 50,000 reviews, 25,000 of them are positive and 25,000 of them are negative.

In this notebook, we perform Data Extraction

### Importing required packages

In [5]:
import numpy as np
import pandas as pd
import urllib.request
import tarfile 
import os

### Retreiving dataset and extracting the contents

In [3]:
url = 'https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz'
filename = 'aclImdb_v1.tar.gz'
if not os.path.exists('aclImdb_v1.tar.gz'):
    urllib.request.urlretrieve(url, filename)
    with tarfile.open('aclImdb_v1.tar.gz', 'r:gz') as tar:
        tar.extractall()

### Creating dataframe from dataset and then saving DataFrame to csv file

In [None]:
folder_name = 'aclImdb'
if not os.path.exists('data/imdb_data.csv'):
    df = pd.DataFrame()
    labels = {'pos':1, 'neg':0}
    for s in ('test', 'train'):
        for l in ('neg', 'pos'):
            path = os.path.join(folder_name, s, l)
            # path looks like 'aclImdb\test\neg'
            for file in os.listdir(path):
                # file looks like '10000_4.txt'
                score = file[-5]
                sentiment = labels[l]
                file_path = os.path.join(path, file)
                # file_path looks like 'aclImdb\test\neg\10000_4.txt' 
                with open(file_path, 'r', encoding='utf-8') as infile:
                    # We should always know the encoding of the file
                    # If we are working with utf-8 encoded file, then we should open it
                    # with encoding=utf-8
                    txt = infile.read()
                df = pd.concat([df, pd.DataFrame([[txt, score, sentiment]])], ignore_index=True)
    df.columns = ['Review', 'Rating', 'Sentiment']
    np.random.seed(42)
    df = df.reindex(np.random.permutation(df.index))
    df.head()
    df.reset_index(drop=True, inplace=True)
    if not os.path.exists('data'):
        os.mkdir('data')
    df.to_csv('data/imdb_data.csv', index=False, encoding='utf-8')

### Displaying DataFrame

In [9]:
df = pd.read_csv('data/imdb_data.csv', encoding='utf-8')
df.head()

Unnamed: 0,Review,Rating,Sentiment
0,Imagine The Big Chill with a cast of twenty-so...,2,0
1,I'd have to say that I've seen worse Sci Fi Ch...,3,0
2,Director Fabio Barreto got a strange Academy N...,1,0
3,Pretty bad PRC cheapie which I rarely bother t...,4,0
4,This is a very intriguing short movie by David...,8,1
