### Extract

In this `.ipynb` we extract the data downloaded from https://archive.ics.uci.edu/ml/datasets/Sentiment+Labelled+Sentences.

The download contains a `.zip` file. I extracted the raw `.txt` files from this `.zip` and saved them in `./data`.

The dir `./data` should contain the following three files:

1. `amazon_cells_labelled.txt`
2. `imdb_labelled.txt`
3. `yelp_labelled.txt`
4. `readme.txt`


This notebook allows us to extract the data from any `...labelled.txt` file in the `./data` directory and convert it into a Pandas.DataFrame.

In [1]:
import os
import pandas as pd

In [2]:
"""
Read in the name of files from our ./data dir.

Only append to files if the file is a .txt. Ignore the readme.txt.
"""

DATA_PATH = './data'

files = [file for file in os.listdir(DATA_PATH) if os.path.splitext(file)[1] == '.txt' and file != 'readme.txt']

In [3]:
print(files)

['amazon_cells_labelled.txt', 'imdb_labelled.txt', 'yelp_labelled.txt']


In [4]:
def file_to_object(path):
    """
    Returns a list of data samples [document, sentiment] from our raw data.
    
    Document is referred to as X (this is our predictor variable).
    Sentiment is reffered to  as y (this is our target variable).
    
    Parameters
    ----------
    path (str) : a path to our raw data to open.
    
    Returns
    -------
    out (list) : a list of lists. Internal list is of structure [document (str), sentiment (int)] 
    """
    
    out = list()
    
    with open(path, 'r') as file:
        for line in file:
            split_line = line.split()      # split the raw line of text from the .txt file into a list.
            X = ' '.join(split_line[:-1])  # take the document (sentence) assign it to X as a str.
            y = int(split_line[-1])        # take the sentiment assign it to y as an int. 
            out.append([X, y])             # append [document, sentiment] to the outer list.
            
    return out

In [5]:
def create_dataframe(file):
    """
    Returns a single Pandas.DataFrame that contains reviews from one of the raw data sources.
    
    Parameters
    ----------
    file (str) : a path to the file to open.
    
    Returns
    -------
    pd.DataFrame : a dataframe that has two columns [document, sentiment].
    """
    
    path = os.path.join(DATA_PATH, file)  # get the path to a file.
    data = file_to_object(path)           # extract the data from that file.
    dataframe = pd.DataFrame(data, columns = ['document', 'sentiment'])  # use pandas to make the data a DataFrame.
        
    return dataframe

In [6]:
"""
Create a dataframe and assign it to dataframe.

I use the amazon_cells_labelled.txt file in this example.

Use dataframe.head() to print an example of the data to our Jupyter notebook.
"""

dataframe = create_dataframe(files[0])

dataframe.head()

Unnamed: 0,document,sentiment
0,So there is no way for me to plug it in here i...,0
1,"Good case, Excellent value.",1
2,Great for the jawbone.,1
3,Tied to charger for conversations lasting more...,0
4,The mic is great.,1


In [7]:
"""
Save the final dataframe as dataframe.pkl in ./data.

We can now open this file in other .ipynb notebooks.
"""

dataframe.to_pickle(os.path.join(DATA_PATH, 'dataframe.pkl'))