# Natural Language Processing - Apple Sentiment
by Michael Kearns

# Business Understanding

Customer relations and approval is highly valued at the Tech Sales Company (TSC). Currently, TSC sells Apple products and we want to be sure if we should continue to market and supply Apple products to our customers. If customers no longer like Apple, we want to separate ourselves from the brand and show our loyalty to the customers and promote other products. To determine if customers support Apple, we plan to develop a natural language processing model that can identify customer sentiment toward Apple based on the content of X (Twitter) posts. Then, this model will be used to monitor customer sentiment over the next year by rating future X posts.

# Data Understanding
To train and test this machine learning model, data from [Crowdflower](https://www.kaggle.com/datasets/slythe/apple-twitter-sentiment-crowdflower) will be used. This dataset includes nearly 4,000 X posts that reference Apple from December, 2014. The primary features that will be used are the "Sentiment" and "Text" features that include the posts and that user's sentiment toward Apple, rated from 1-3 scale


## Data Preparation

In [1]:
import pandas as pd

#import csv file
filename = 'data/Apple-Twitter-Sentiment-DFE.csv'
df = pd.read_csv(filename, encoding="latin1")

#check dataframe info
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3886 entries, 0 to 3885
Data columns (total 12 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   _unit_id              3886 non-null   int64  
 1   _golden               3886 non-null   bool   
 2   _unit_state           3886 non-null   object 
 3   _trusted_judgments    3886 non-null   int64  
 4   _last_judgment_at     3783 non-null   object 
 5   sentiment             3886 non-null   object 
 6   sentiment:confidence  3886 non-null   float64
 7   date                  3886 non-null   object 
 8   id                    3886 non-null   float64
 9   query                 3886 non-null   object 
 10  sentiment_gold        103 non-null    object 
 11  text                  3886 non-null   object 
dtypes: bool(1), float64(2), int64(2), object(7)
memory usage: 337.9+ KB


Only the "sentiment" and "text" columns will be retained for this model. All other columns can be removed.

In [2]:
#Keep relevant columns in dataset.
df = df[['sentiment','text']]
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3886 entries, 0 to 3885
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   sentiment  3886 non-null   object
 1   text       3886 non-null   object
dtypes: object(2)
memory usage: 60.8+ KB


In [3]:
#Check overall sentiment distribution
print(df['sentiment'].value_counts())

sentiment
3               2162
1               1219
5                423
not_relevant      82
Name: count, dtype: int64


Based on background information provided by the datasource, sentiment is ranked as 1 - Negative, 3 - Neutral, 5 - Positive.

In [4]:
#Check proportion of sentiment values
print(df['sentiment'].value_counts(normalize = True))

sentiment
3               0.556356
1               0.313690
5               0.108852
not_relevant    0.021101
Name: proportion, dtype: float64


"Neutral" sentiment posts take up almost 56% of the datset, followed by %31 "Negative" posts, and %11 "Positive" posts. If the model was created based on this current distribution, the model would likely be more influenced by the "Neutral" or "Negative" posts and will not be able to accurately rate future posts. Therefore, some tactics will need to be implemented to deal with the class imbalance. Before the class imbalance is addressed, the data will need to be cleaned/preprocessed and split into train and test sets.

There are less than 100 posts labeled as "not_relevant". These can be removed and will not be considered in the model.

In [6]:
#Remove "not_relevant" rows
df = df[df['sentiment']!= 'not_relevant']

#Recheck proportion of sentiment values and confirm "not_relevant" rows are removed
print(df['sentiment'].value_counts(normalize = True))

sentiment
3    0.568349
1    0.320452
5    0.111199
Name: proportion, dtype: float64


To make the text more suitable for a machine learning model, the text needs to be tokenized. This will be done using the **nltk** module.

In [None]:
import nltk
from nltk.probability import FreqDist
from nltk.corpus import stopwords
from nltk.tokenize import regexp_tokenize, word_tokenize, RegexpTokenizer
import matplotlib.pyplot as plt
import string
import re

# Exploratory data Anlysi

# Conclusion

## Limitations

## Recommendations

## Next Steps