# Harper Adams Data Science 
## NLP Basics in python
<center>
   <img src="img/HAP-E-logo.png" alt="HAP-E Group" width="125"/>
</center>

Ed Harris </br>
HARUG! / HAP-E Group </br>
2021-11-17 </br>

# 0 What do we want to accomplish today?

1. Remarks on Python environment setup
2. Introduce Natural Language Processing (NLP) and sentiment analysis
3. Explore the dataset and example
4. Make a basic 'word cloud'
5. Word cloud graphical data summary
6. Data handling for sentiment analysis
7. First sentiment model
8. Evaluating the sentiment model
9. Live coding (you try it)

&nbsp;

# 1. Remarks on Python environment setup

The choices from easiest and best for the beginner (NB beginner means anyone who is not confident setting environment path for programming language on their own operating system, anyone who has not used Linux before, anyone who is not confident managing files with the command line), to hardest

- Anaconda
- Colab
- JupyterLab Standalone app
- any other method

&nbsp;

In [None]:
# NB you need pandas installed for this
# import pandas for data tools
import pandas as pd

# NB you need plotly installed for this
# imports tools for graphs
import matplotlib.pyplot as plt
import seaborn as sns
color = sns.color_palette()
%matplotlib inline

# NB you need plotly installed for this
import plotly.offline as py
py.init_notebook_mode(connected=True)
import plotly.graph_objs as go
import plotly.tools as tls
import plotly.express as px
import plotly.offline as pyo



In [None]:
# for wordclouds and NLP
import nltk
from nltk.corpus import stopwords

# Import package
from wordcloud import WordCloud, STOPWORDS

# 2. Introduce Natural Language Processing (NLP) and sentiment analysis

Use cases for sentiment analysis:
- Company/feedback from consumers
- Survey analysis
- Political tendency
- Many others

Use cases for NLP:
- Chat bots (e.g. customer service, help lines, etc.)
- Auto-generation and analysis of text
- language translation

Caveat:
- This is a broad and deep topic with active research
- It is considered hard and hard to learn to be an expert
- Whole textbooks exist
- Active area of research, new developments

# Import package
from wordcloud import WordCloud, STOPWORDS

&nbsp;

# 3. Explore the dataset and example

## The dataset
The dataset we will use is a fairly chunky collection of food reviews from *Amazon.com*: [**Amazon Fine Food from Kaggle**](https://www.kaggle.com/snap/amazon-fine-food-reviews?select=Reviews.csv)

Original data source J. McAuley and J. Leskovec. From amateurs to connoisseurs: modeling the evolution of user expertise through online reviews. WWW, 2013.


&nbsp;

In [None]:
# read the data
df = pd.read_csv('data/Reviews.csv')


## The variables

- **Id** - row reference
- **ProductId** - Amazon product ID
- **UserId** - Amazon user ID
- **ProfileName** - Amazon user account profile name
- **HelpfulnessNumerator** - # users who found review helpful
- **HelpfulnessDenominator** - # users who reviewed the product
- **Score** - star rating 1 to 5 (5 is best/perfect)
- **Time** - measure of time (seconds since 1970, or similar?)
- **Summary** - Text summary of review
- **Text** - Text review


Example product (row 3 **B000LQOCH0**)

https://www.amazon.com/Turkish-Delight-Filbert-Hazelnuts-Sultan/dp/B000LQOCH0/ref=cm_cr_arp_d_product_top?ie=UTF8

In [None]:
# how many rows and columns are in the data?
print(df.shape)


In [None]:
# print the first few rows of data
df.head()

In [None]:
## What is the distribution of the rating like?


In [None]:
plt.hist(df['Score'], bins = 5)
plt.show()

# 4. Make a basic 'word cloud'

&nbsp;

In [None]:
# Create stopword list:
import nltk
nltk.download('stopwords')
stopwords = set(stopwords.words('english'))
stopwords.update(["br", "href"])
textt = " ".join(review for review in df.Text)

wordcloud = WordCloud(stopwords=stopwords).generate(textt)
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.savefig('wordcloud11.png')
plt.show()

In [None]:
# assign reviews with score > 3 as positive sentiment
# score < 3 negative sentiment
# remove score = 3
df = df[df['Score'] != 3]
df['sentiment'] = df['Score'].apply(lambda rating : +1 if rating > 3 else -1)

In [None]:
# look at new data
df.head()

# 5. Word cloud graphical data summary

&nbsp;



In [None]:
# make 2 new data frames, one for positive one for negative

# split df - positive and negative sentiment:
positive = df[df['sentiment'] == 1]
negative = df[df['sentiment'] == -1]

In [None]:
# not working
# Generate word cloud
wordcloud = WordCloud(width= 3000, height = 2000, 
                      random_state=1, background_color='salmon', 
                      colormap='Pastel1', collocations=False, 
                      stopwords = stopwords.words('english')).generate(text)
# Plot
plot_cloud(wordcloud)

# 6. Data handling for sentiment analysis

&nbsp;

# 7. First sentiment model

&nbsp;

In [None]:
#distribution of sentiment
df['sentimentt'] = df['sentiment'].replace({-1 : 'negative'})
df['sentimentt'] = df['sentimentt'].replace({1 : 'positive'})

plt.hist(df['sentimentt'], bins = 2)
plt.show()

# 8. Evaluating the sentiment model

&nbsp;

# 9. Live coding (you try it)
<center>
   <img src="img/cat-laptop.jfif" alt="cat hacker" width="300"/>
</center>

&nbsp;