<div class="alert alert-danger" role="alert">
    <span style="font-size:20px">&#9888;</span> <span style="font-size:16px">This is a read-only notebook! If you want to make and save changes, save a copy by clicking on <b>File</b> &#8594; <b>Save a copy</b>. If this is already a copy, you can delete this cell.</span>
</div>

# Sentiment analysis

In [15]:
import re

import pandas as pd

from nlp import TextPreparation

pd.set_option('display.max_colwidth', 500)

In [1]:
# Add path of the folder 'resources' to the path from which we can import modules  
import sys
sys.path.append('../../utilities')

# Import data

In [24]:
# Load the train document
text_field = "Text"

dataset = pd.read_csv("./sample_input/sentiment_train.csv", encoding="utf8").reset_index()
dataset = dataset[["index","Good Bad or Neutral", "URL of the article"]]
dataset.columns = ["Id", "Sentiment", "Text"]

dataset.head()

Unnamed: 0,Id,Sentiment,Text
0,0,Bad,Ford Motor: That's a Lot of Recalls -- Barron's Blog
1,1,Neutral,"Press Release: Event Alert: Kinaxis Customer, Ford Motor Company, to Present at North American Supply Chain Executive Summit"
2,2,Bad,Ford Motor: How Risky is Its Autonomous Driving Plan? -- Barron's Blog
3,3,Good,Ford Motor Plans Ride-Hailing Service With Fleet of Driverless Cars by 2021
4,4,Bad,Ford Motor Files 8K - Other Events >F


## General preprocessing

In this section we apply the commonly applyed preprocessing in order to get a cleaner data. If you have questions about the functions in this section you can refer to the notebook [`01- Basic NLP preprocessing.ipynb`](http://localhost:8888/lab/tree/template/notebooks/01-%20Basic%20NLP%20preprocessing.ipynb) available on the `notebooks` folder.

In [25]:
dataset[text_field] = TextPreparation.lowercase(dataset[text_field])
dataset[text_field] = TextPreparation.expand_contractions(dataset[text_field])
dataset[text_field] = TextPreparation.remove_special_chars(dataset[text_field])
dataset[text_field] = TextPreparation.remove_numbers(dataset[text_field])
dataset[text_field] = TextPreparation.remove_stopwords(dataset[text_field])

dataset.head()

Unnamed: 0,Id,Sentiment,Text
0,0,Bad,ford motor lot recalls barrons blog
1,1,Neutral,press release event alert kinaxis customer ford motor company present north american supply chain executive summit
2,2,Bad,ford motor risky autonomous driving plan barrons blog
3,3,Good,ford motor plans ridehailing service fleet driverless cars
4,4,Bad,ford motor files k events f


## Specifics preprocessing

Here we add some extra preprocessing that are specific to the dataset we are dealing with.

In [30]:
# Change company identifiers to 'compani' (stemmer.stem("company"))

companies = [
    "cemex", "facebook", "facebooks", "lukoil", "dowdupont", "tesla", "uber", 
    "disney", "reliance", "saic", "gerdau", "deutsche", "kinder", 
    "morgan", "motor", "bank", "ford", "exxon", "masco", "ubs",
    "fiat", "daimler", "alphabet", "basf", "suncor", "apple", "wells",
    "fargo", "citigroup", "citi", "comcast", "viacom"
]


for company_to_replace in companies:
    regex_expression = re.compile(fr'\b{company_to_replace}\b', re.IGNORECASE)
    dataset[text_field].replace(regex_expression, "company", inplace=True)
    
dataset.head()

Unnamed: 0,Id,Sentiment,Text
0,0,Bad,company company lot recalls barrons blog
1,1,Neutral,press release event alert kinaxis customer company company company present north american supply chain executive summit
2,2,Bad,company company risky autonomous driving plan barrons blog
3,3,Good,company company plans ridehailing service fleet driverless cars
4,4,Bad,company company files k events f


In [31]:
dataset.to_csv("./sample_output/sentiment_train_processed1.csv", index=False)

## Conclusion
