# Black Friday Tweets Sentiment Analysis Project

## Introduction
Hey there! Welcome to my Black Friday Tweets Sentiment Analysis project!
 
## Overview
In the era of digital communication, social media platforms like Twitter serve as invaluable sources of information and insights into public opinion. In this project, I'll be delving into the realm of sentiment analysis, a natural language processing technique aimed at discerning the emotional tone behind textual data. My focus? Black Friday tweets.

## Project Objectives
- **Sentiment Analysis:** Understand the sentiment (positive, negative, or neutral) expressed in Black Friday tweets.
- **Machine Learning Pipeline:** Develop a robust machine learning pipeline using Python and Jupyter Notebook to preprocess data, train a Logistic Regression model, and evaluate its accuracy.
- **Data Visualization:** Leverage cloud services like Amazon S3, Athena, and QuickSight for effective storage, organization, and visualization of sentiment analysis results.

## Why Black Friday Tweets?
Black Friday, a significant shopping event, generates a buzz on social media platforms. Analyzing the sentiment of related tweets can provide valuable insights into public perception, trends, and reactions.

## Technologies Used
- **Python and Jupyter Notebook:** For coding and interactive development.
- **Amazon S3:** Scalable storage for data and results.
- **Amazon Athena:** Enables interactive querying and analysis of data in Amazon S3.
- **Amazon QuickSight:** Business analytics tool for creating interactive visualizations.

## Project Workflow
1. **Data Collection:** Using the Twitter API to gather tweets related to Black Friday.
2. **Data Preprocessing:** Cleaning and transforming tweet data for effective analysis.
3. **Model Training:** Training a Logistic Regression model for sentiment analysis.
4. **Model Evaluation:** Assessing the accuracy of the trained model.
5. **Data Visualization:** Storing and visualizing sentiment analysis results using cloud services.

## Getting Started
To explore the fascinating world of Black Friday tweets and sentiment analysis, let's dive into the Jupyter Notebook sections. Each step is meticulously crafted to guide you through the process and insights derived from this analysis.

Let's embark on this journey into the sentiments expressed in the Black Friday Twitterverse!




## Importing the libraries

In [0]:
#pip install --upgrade pip

[43mNote: you may need to restart the kernel using dbutils.library.restartPython() to use updated packages.[0m
Collecting pip
  Downloading pip-23.3.2-py3-none-any.whl (2.1 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 2.1/2.1 MB 10.6 MB/s eta 0:00:00
Installing collected packages: pip
  Attempting uninstall: pip
    Found existing installation: pip 22.2.2
    Uninstalling pip-22.2.2:
      Successfully uninstalled pip-22.2.2
Successfully installed pip-23.3.2
[43mNote: you may need to restart the kernel using dbutils.library.restartPython() to use updated packages.[0m


In [0]:
 !pip install vaderSentiment

[43mNote: you may need to restart the kernel using dbutils.library.restartPython() to use updated packages.[0m
Collecting vaderSentiment
  Downloading vaderSentiment-3.3.2-py2.py3-none-any.whl (125 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 126.0/126.0 kB 1.5 MB/s eta 0:00:00
Installing collected packages: vaderSentiment
Successfully installed vaderSentiment-3.3.2
[43mNote: you may need to restart the kernel using dbutils.library.restartPython() to use updated packages.[0m


In [0]:
dbutils.library.restartPython()

In [0]:
# Pyspark SQL  
# Feature Transformation / Data Cleaning
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, FloatType, IntegerType
from pyspark.sql.functions import col, sum, udf
import pyspark.sql.functions as F
#Python Regex
import re

import pandas as pd
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

# Sentiment Analyzer
# !pip install vaderSentiment
# VADER
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer #vader sentiments

# Pyspark Machine Learning
from pyspark.ml.feature import CountVectorizer,NGram, VectorAssembler, StopWordsRemover, HashingTF, IDF, Tokenizer, StringIndexer
from pyspark.ml.classification import LogisticRegression
from pyspark.ml import Pipeline
from pyspark.ml.evaluation import BinaryClassificationEvaluator

## Load The Sentiment Data
Mount to wcd public dataset bucket

In [0]:
def mount_s3_bucket(access_key, secret_key, bucket_name, mount_folder):
  ACCESS_KEY_ID = access_key
  SECRET_ACCESS_KEY = secret_key
  ENCODED_SECRET_KEY = SECRET_ACCESS_KEY.replace("/", "%2F")

  print ("Mounting", bucket_name)

  try:
    # Unmount the data in case it was already mounted
    dbutils.fs.unmount("/mnt/%s" % mount_folder)
    
  except:
    # If it fails to unmount it most likely wasn't mounted in the first place
    print ("Directory not unmounted: ", mount_folder)
    
  finally:
    # Lastly, mount our bucket.
    dbutils.fs.mount("s3a://%s:%s@%s" % (ACCESS_KEY_ID, ENCODED_SECRET_KEY, bucket_name), "/mnt/%s" % mount_folder)
    #dbutils.fs.mount("s3a://"+ ACCESS_KEY_ID + ":" + ENCODED_SECRET_KEY + "@" + bucket_name, mount_folder)
    print ("The bucket", bucket_name, "was mounted to", mount_folder, "\n")
    

In [0]:
# Set AWS programmatic access credentials

In [0]:
# Mount the dataset
mount_s3_bucket(ACCESS_KEY, SECRET_ACCESS_KEY, "weclouddata/twitter/", "data")

Mounting weclouddata/twitter/
/mnt/data has been unmounted.
The bucket weclouddata/twitter/ was mounted to data 



In [0]:
# Explore the mounted folder
# %fs ls /mnt/data/

# when the line magic function above doesnt work, use the function below i.e when %fs ls does not work use dbutils.fs.ls('path to the file/folder')

dbutils.fs.ls("/mnt/data")

[FileInfo(path='dbfs:/mnt/data/AI/', name='AI/', size=0, modificationTime=0),
 FileInfo(path='dbfs:/mnt/data/BlackFriday/', name='BlackFriday/', size=0, modificationTime=0),
 FileInfo(path='dbfs:/mnt/data/CSIS/', name='CSIS/', size=0, modificationTime=0),
 FileInfo(path='dbfs:/mnt/data/Do Not Use/', name='Do Not Use/', size=0, modificationTime=0),
 FileInfo(path='dbfs:/mnt/data/ElonMusk/', name='ElonMusk/', size=0, modificationTime=0),
 FileInfo(path='dbfs:/mnt/data/Inflation/', name='Inflation/', size=0, modificationTime=0),
 FileInfo(path='dbfs:/mnt/data/Iran/', name='Iran/', size=0, modificationTime=0),
 FileInfo(path='dbfs:/mnt/data/MTA/', name='MTA/', size=0, modificationTime=0),
 FileInfo(path='dbfs:/mnt/data/WorldCup/', name='WorldCup/', size=0, modificationTime=0),
 FileInfo(path='dbfs:/mnt/data/cancer/', name='cancer/', size=0, modificationTime=0),
 FileInfo(path='dbfs:/mnt/data/thanksgiving/', name='thanksgiving/', size=0, modificationTime=0),
 FileInfo(path='dbfs:/mnt/data/t

In [0]:
# display the mounted data  folder
files = dbutils.fs.ls("/mnt/data")
display(files)

path,name,size,modificationTime
dbfs:/mnt/data/AI/,AI/,0,0
dbfs:/mnt/data/BlackFriday/,BlackFriday/,0,0
dbfs:/mnt/data/CSIS/,CSIS/,0,0
dbfs:/mnt/data/Do Not Use/,Do Not Use/,0,0
dbfs:/mnt/data/ElonMusk/,ElonMusk/,0,0
dbfs:/mnt/data/Inflation/,Inflation/,0,0
dbfs:/mnt/data/Iran/,Iran/,0,0
dbfs:/mnt/data/MTA/,MTA/,0,0
dbfs:/mnt/data/WorldCup/,WorldCup/,0,0
dbfs:/mnt/data/cancer/,cancer/,0,0


In [0]:
%fs ls /mnt/data/BlackFriday/2022/

path,name,size,modificationTime
dbfs:/mnt/data/BlackFriday/2022/11/,11/,0,0
dbfs:/mnt/data/BlackFriday/2022/12/,12/,0,0


In [0]:
# I am using Black Friday Data
# %fs ls /mnt/data

# data/topic/year/month/day/hour/files
# filePath = '/mnt/data/BlackFriday/*/*/*/*/*' # wildcard filtering for selecting all files 
# filePath

# file path for tweets about Black Friday on November 24th, 2022 
# Due to limited computing power, we will choose tweets from only the day before Black Friday (12/05).
filePath = '/mnt/data/BlackFriday/2022/12/05/*/*'

## Create Spark Session

In [0]:
spark = (SparkSession
        .builder
        .appName('Twitter Sentiment Analysis')
        .getOrCreate())

print('Session created')

Session created


In [0]:
sc = spark.sparkContext

In [0]:
# Define schema
schema = StructType([
    StructField('id', StringType(), True),
    StructField('name', StringType(), True),
    StructField('username', StringType(), True),
    StructField('tweet', StringType(), True),
    StructField('followers_count', StringType(), True),
    StructField('location', StringType(), True),
    StructField('geo', StringType(), True),
    StructField('created_at', StringType(), True)
])
     


In [0]:
# read data from the selected file path
df_bf = (spark.read.schema(schema).option('delimiter','\t').csv(filePath))

In [0]:
# cache the dataframe for faster iteration
df_bf.cache()

# run the count action to materialize the cache and speed up the read process
df_bf.count()

20718

In [0]:
df_bf.show()

+-------------------+--------------------+---------------+--------------------+---------------+----------------+----+--------------------+
|                 id|                name|       username|               tweet|followers_count|        location| geo|          created_at|
+-------------------+--------------------+---------------+--------------------+---------------+----------------+----+--------------------+
|1599824098993266688|E L • P A P I 🔥?...|   beto68290871|RT @Alex_boy_1: ?...|         168899| Yucatán, México|None|Mon Dec 05 17:52:...|
|1599824132052439040|                 Max|    Max40510425|RT @AvaKoxxx: ♠️ ...|             44|            None|None|Mon Dec 05 17:52:...|
|1599824144861839362|      Beautiful Love| lov3_b3autiful|RT @lunaseduces: ...|            335|            None|None|Mon Dec 05 17:52:...|
|1599824166936797185|          seungwoo94|    seungwoo941|RT @preorderwithp...|              1|            None|None|Mon Dec 05 17:52:...|
|1599824169231081472|       

## Exploratory Data Analysis (EDA) & Data Cleaning


### Text Cleaning Preprocessing

### Dropping rows in Tweet column with null/na values



`pyspark.sql.functions.regexp_replace` is used to process the text

1. Remove URLs such as `http://cnn.com`
2. Remove special characters
3. Substituting multiple spaces with single space
4. Lowercase all text
5. Trim the leading/trailing whitespaces

In [0]:
# Get the shape of the DataFrame
num_rows = df_bf.count()
num_cols = len(df_bf.columns)

print("Number of rows:", num_rows)
print("Number of columns:", num_cols)

Number of rows: 20718
Number of columns: 8


In [0]:
# check for null values for column 'tweet'

# Count the number of rows with null values in the "tweet" column
null_tweet_count = df_bf.select(sum(col("tweet").isNull().cast("int"))).collect()[0][0]

print(" The number of rows with null values in the 'tweet' column:", null_tweet_count)

 The number of rows with null values in the 'tweet' column: 52


In [0]:
# Drop rows with null values in the "tweet" column
df_bf_cleaned = df_bf.na.drop(subset=["tweet"])

# Check the number of rows after dropping null values
remaining_rows = df_bf_cleaned.count()
print("Number of remaining rows after dropping null values:", remaining_rows)

Number of remaining rows after dropping null values: 20666


In [0]:
display(df_bf_cleaned.take(10))

id,name,username,tweet,followers_count,location,geo,created_at
1599824098993266688,E L • P A P I 🔥🇲🇽★,beto68290871,RT @Alex_boy_1: 😱omg!!! experience with flamingo app Control smart vibrator 😇 👅💦Wow very powerful toy Feeling amazing..... Must try once G…,168899,"Yucatán, México",,Mon Dec 05 17:52:43 +0000 2022
1599824132052439040,Max,Max40510425,RT @AvaKoxxx: ♠️ BLACK FRIDAY ♠️ . BRAND NEW EXCLUSIVE SEX TAPE ♠️ . FREE TO SUBSCRIBE . https://t.co/490yXY9GDY https://t.co/U1lvbO52QM,44,,,Mon Dec 05 17:52:51 +0000 2022
1599824144861839362,Beautiful Love,lov3_b3autiful,RT @lunaseduces: BLACK FRIDAY SALE! My onlyfans is 50% OFF until Monday! Link in comments ⬇️ https://t.co/tmoNBv8JDX,335,,,Mon Dec 05 17:52:54 +0000 2022
1599824166936797185,seungwoo94,seungwoo941,RT @preorderwithpj: LE SPECS BLACK FRIDAY SALE🔖 https://t.co/rJqxhB9PIp,1,,,Mon Dec 05 17:52:59 +0000 2022
1599824169231081472,Guy shows!,sexypenis16,"RT @LouiseMoorexo95: Black Friday sale ,come join me 💕😈 https://t.co/FSCBC6ak5e https://t.co/H8gCnOsxr5",325,,,Mon Dec 05 17:52:59 +0000 2022
1599824177824866305,Silver Star,SilverS25497353,"RT @itsLexi91: HAPPY THANKSGIVING!!! Black Friday Special, Half Off All my content!! 😈 #nsfwtwt https://t.co/zoUGjsjFgd",57,,,Mon Dec 05 17:53:01 +0000 2022
1599824215904964608,I FCK FANS 💦 $5 Onlyfans,hoodrichesha,"RT @ale_only_25: Jumping 💦 Black Friday on my NO PPV, 15$ https://t.co/4sj0SxESfV",223905,"Florida, USA",,Mon Dec 05 17:53:11 +0000 2022
1599824227405905924,David Gallows,DavidGallows,Hi @westerndigital I tried to buy one of your external drives during the Black Friday sale. DHL failed to deliver… https://t.co/2BvoEoBcvx,226,,,Mon Dec 05 17:53:13 +0000 2022
1599824249509715968,Stephen Vinson,whoatemyblog,I bought a go pro on Black Friday cause I’m nuts. Gonna get out of my comfort zone and vlog. Good thing the camera… https://t.co/TpCtUGnw5Y,2554,"Birmingham, AL",,Mon Dec 05 17:53:19 +0000 2022
1599824293344382976,E L • P A P I 🔥🇲🇽★,beto68290871,RT @Alex_boy_1: 😱omg!!! experience with flamingo app Control smart vibrator 😇 👅💦Wow very powerful toy Feeling amazing..... Must try once G…,168903,"Yucatán, México",,Mon Dec 05 17:53:29 +0000 2022


In [0]:
# df_clean = df_elon_musk_cleaned.withColumn('tweet', F.regexp_replace('tweet', r"http\S+", "")) \ # replace tweets that has url format with nothing i.e delete all url  
#                     .withColumn('tweet', F.regexp_replace('tweet', r"[^a-zA-Z]", " ")) \ # replace anything that is not an alphabet  with space i.e delete all the special characters . This line of code can also be used to replace anything taht is not a number as well 
#                     .withColumn('tweet', F.regexp_replace('tweet', r"\s+", " ")) \ # replace multiple spaces with a single space 
#                     .withColumn('tweet', F.lower('tweet')) \ # we are going to lower the tweet 
#                     .withColumn('tweet', F.trim('tweet')) # here we are going to trim the tweet 


df_clean = df_bf_cleaned.withColumn('tweet', F.regexp_replace('tweet', r"http\S+", "")) \
                    .withColumn('tweet', F.regexp_replace('tweet', r"[^a-zA-Z]", " ")) \
                    .withColumn('tweet', F.regexp_replace('tweet', r"\s+", " ")) \
                    .withColumn('tweet', F.lower('tweet')) \
                    .withColumn('tweet', F.trim('tweet')) 
display(df_clean)

id,name,username,tweet,followers_count,location,geo,created_at
1599824098993266688,E L • P A P I 🔥🇲🇽★,beto68290871,rt alex boy omg experience with flamingo app control smart vibrator wow very powerful toy feeling amazing must try once g,168899.0,"Yucatán, México",,Mon Dec 05 17:52:43 +0000 2022
1599824132052439040,Max,Max40510425,rt avakoxxx black friday brand new exclusive sex tape free to subscribe,44.0,,,Mon Dec 05 17:52:51 +0000 2022
1599824144861839362,Beautiful Love,lov3_b3autiful,rt lunaseduces black friday sale my onlyfans is off until monday link in comments,335.0,,,Mon Dec 05 17:52:54 +0000 2022
1599824166936797185,seungwoo94,seungwoo941,rt preorderwithpj le specs black friday sale,1.0,,,Mon Dec 05 17:52:59 +0000 2022
1599824169231081472,Guy shows!,sexypenis16,rt louisemoorexo black friday sale come join me,325.0,,,Mon Dec 05 17:52:59 +0000 2022
1599824177824866305,Silver Star,SilverS25497353,rt itslexi happy thanksgiving black friday special half off all my content nsfwtwt,57.0,,,Mon Dec 05 17:53:01 +0000 2022
1599824215904964608,I FCK FANS 💦 $5 Onlyfans,hoodrichesha,rt ale only jumping black friday on my no ppv,223905.0,"Florida, USA",,Mon Dec 05 17:53:11 +0000 2022
1599824227405905924,David Gallows,DavidGallows,hi westerndigital i tried to buy one of your external drives during the black friday sale dhl failed to deliver,226.0,,,Mon Dec 05 17:53:13 +0000 2022
1599824249509715968,Stephen Vinson,whoatemyblog,i bought a go pro on black friday cause i m nuts gonna get out of my comfort zone and vlog good thing the camera,2554.0,"Birmingham, AL",,Mon Dec 05 17:53:19 +0000 2022
1599824293344382976,E L • P A P I 🔥🇲🇽★,beto68290871,rt alex boy omg experience with flamingo app control smart vibrator wow very powerful toy feeling amazing must try once g,168903.0,"Yucatán, México",,Mon Dec 05 17:53:29 +0000 2022


In [0]:
df_clean.columns

['id',
 'name',
 'username',
 'tweet',
 'followers_count',
 'location',
 'geo',
 'created_at']

##### create a sentiment using VADER  (other libaries to create sentiments ate textblocks, reader,  using positive and negative tweets)

### Generate Lexicon-/Rule-based Sentiments (auto-label) with VADER
#### VADER = Valence Aware Dictionary for Sentiment Reasoning. It is a package that can be used to generate polarity scores on unlabeled text data and is well-suite (built) for social media data.

- It is lexicon & rule-based; meaning it essentially has pre-programmed sentiment assigned to words and groups of words
- It can understand capitalizations, emoticons, conjunctions, punctuations (!!!), negations (i.e. NOT good), slang (kinda kewl), acronyms (omg rotfl)
- The rules were generated by actual humans, and the statistics of the rating were considered
- More info here: [piocalderon/vader](https://medium.com/@piocalderon/vader-sentiment-analysis-explained-f1c4f9101cd9)

In [0]:
# define a function to get sentiment score using VADER
def getSentimentScore(tweetText):
    sia = SentimentIntensityAnalyzer() #initialize the sentiment analyzer
    ss = sia.polarity_scores(tweetText) #apply the analyzer to input text
    return float(ss['compound'])  #return the compound score that is an accumulation of the positive, negative

# define a function to get sentiment
def getSentiment(score):
    return 1 if score >= 0 else 0
     

In [0]:
# create sentiment score column
udfss=udf(getSentimentScore, FloatType())
df_clean = df_clean.withColumn('sentiment score',udfss('tweet'))

In [0]:
# Create sentiment column - positive: 1, negative: 0
udf_sentiment = udf(getSentiment, IntegerType())
df_clean_with_sentiment = df_clean.withColumn('sentiment', udf_sentiment('sentiment score'))

In [0]:
# df_clean_with_sentiment is the dataframe with the original elon musk data + sentiment + sentiment score
display(df_clean_with_sentiment.show(5))


+-------------------+--------------------+--------------+--------------------+---------------+---------------+----+--------------------+---------------+---------+
|                 id|                name|      username|               tweet|followers_count|       location| geo|          created_at|sentiment score|sentiment|
+-------------------+--------------------+--------------+--------------------+---------------+---------------+----+--------------------+---------------+---------+
|1599824098993266688|E L • P A P I 🔥?...|  beto68290871|rt alex boy omg e...|         168899|Yucatán, México|None|Mon Dec 05 17:52:...|         0.9344|        1|
|1599824132052439040|                 Max|   Max40510425|rt avakoxxx black...|             44|           None|None|Mon Dec 05 17:52:...|         0.5859|        1|
|1599824144861839362|      Beautiful Love|lov3_b3autiful|rt lunaseduces bl...|            335|           None|None|Mon Dec 05 17:52:...|            0.0|        1|
|1599824166936797185|  

In [0]:
# select the sentiment and tweet column for the purpose of this sentiment analysis
final_df=df_clean_with_sentiment.select('sentiment','tweet')
display(final_df.show(10))


+---------+--------------------+
|sentiment|               tweet|
+---------+--------------------+
|        1|rt alex boy omg e...|
|        1|rt avakoxxx black...|
|        1|rt lunaseduces bl...|
|        1|rt preorderwithpj...|
|        1|rt louisemoorexo ...|
|        1|rt itslexi happy ...|
|        0|rt ale only jumpi...|
|        0|hi westerndigital...|
|        1|i bought a go pro...|
|        1|rt alex boy omg e...|
+---------+--------------------+
only showing top 10 rows



In [0]:
display(final_df)

sentiment,tweet
1,rt alex boy omg experience with flamingo app control smart vibrator wow very powerful toy feeling amazing must try once g
1,rt avakoxxx black friday brand new exclusive sex tape free to subscribe
1,rt lunaseduces black friday sale my onlyfans is off until monday link in comments
1,rt preorderwithpj le specs black friday sale
1,rt louisemoorexo black friday sale come join me
1,rt itslexi happy thanksgiving black friday special half off all my content nsfwtwt
0,rt ale only jumping black friday on my no ppv
0,hi westerndigital i tried to buy one of your external drives during the black friday sale dhl failed to deliver
1,i bought a go pro on black friday cause i m nuts gonna get out of my comfort zone and vlog good thing the camera
1,rt alex boy omg experience with flamingo app control smart vibrator wow very powerful toy feeling amazing must try once g


### Feature Transformer: Tokenizer
#### Tokenizer divides the tweet i.e strings  into a list of words

In [0]:

# Tokenize the tweets
tokenizer = Tokenizer(inputCol="tweet", outputCol="tokens") 
final_df_tokenized = tokenizer.transform(final_df)

display(final_df_tokenized.head(5))

sentiment,tweet,tokens
1,rt alex boy omg experience with flamingo app control smart vibrator wow very powerful toy feeling amazing must try once g,"List(rt, alex, boy, omg, experience, with, flamingo, app, control, smart, vibrator, wow, very, powerful, toy, feeling, amazing, must, try, once, g)"
1,rt avakoxxx black friday brand new exclusive sex tape free to subscribe,"List(rt, avakoxxx, black, friday, brand, new, exclusive, sex, tape, free, to, subscribe)"
1,rt lunaseduces black friday sale my onlyfans is off until monday link in comments,"List(rt, lunaseduces, black, friday, sale, my, onlyfans, is, off, until, monday, link, in, comments)"
1,rt preorderwithpj le specs black friday sale,"List(rt, preorderwithpj, le, specs, black, friday, sale)"
1,rt louisemoorexo black friday sale come join me,"List(rt, louisemoorexo, black, friday, sale, come, join, me)"


### Feature Transformer: Stopword Removal
##### Stopword makes the tweets grammatically correct but it does not give any semantic meaning so I will be removing all stopwords.

In [0]:
# Remove stopword
stopword_remover = StopWordsRemover(inputCol="tokens", outputCol="filtered")
final_df_stopword = stopword_remover.transform(final_df_tokenized)

#display(final_df_stopword.head(5))

### Feature Transformer: CountVectorizer (TF - Term Frequency)

#### Looking at the frequency of the words per tweet 


In [0]:

# Apply count vectorizer
cv = CountVectorizer(vocabSize=2**16, inputCol="filtered", outputCol='cv')
cv_model = cv.fit(final_df_stopword)
final_df_cv = cv_model.transform(final_df_stopword)

#display(final_df_cv.show(5))

### Feature Transformer: TF-IDF Vectorization
#### Looking at the frequency of words throughout the entire document 

In [0]:
# TF-IDF Vectorization
idf = IDF(inputCol='cv', outputCol="features", minDocFreq=5) 
idf_model = idf.fit(final_df_cv)
final_df_idf = idf_model.transform(final_df_cv)

#display(final_df_idf.head(5)) # The dataframe is now ready for the following machine learning stage.

#### We use label encoder when our sentiment column is a string like positive or negative but since it is already in numerical format I won't be using the label encoder 

### Model Training: Logistic Regression

In [0]:

# rename column 'sentiment' to 'label'
final_df_idf = final_df_idf.withColumnRenamed("sentiment", "label")

In [0]:
# split the data into traning and test sets
train_data, test_data = final_df_idf.randomSplit([0.7, 0.3], seed=1234)

lr = LogisticRegression(maxIter=100)

lr_model = lr.fit(train_data)

predictions = lr_model.transform(test_data)

#display(predictions.head(5))

### Model Evaluation 

In [0]:

# Evaluate the model using binary classification evaluation 
# MulticlassClassificationEvaluator # for positive, neutral and negative 
evaluator = BinaryClassificationEvaluator(rawPredictionCol="rawPrediction") 
roc_auc = evaluator.evaluate(predictions)
accuracy = predictions.filter(predictions.label == predictions.prediction).count() / float(predictions.count())

print("Accuracy Score: {0:.4f}".format(accuracy))
print("ROC-AUC: {0:.4f}".format(roc_auc))
     

Accuracy Score: 0.9462
ROC-AUC: 0.9026


In [0]:
# # To save the prediction 
# # Mount your bucket (your mount is called my_bucket)
# prediction.write.mode('overwrite').csv('mnt/ptb2-olusegun2/prediction.csv')

### Save the data and the predictions into my bucket

In [0]:
m#ount_s3_bucket(access_key, secret_key, bucket_name, mount_folder)

In [0]:
# mounting the bl;ack friday data to my s3 bucket 
mount_s3_bucket(ACCESS_KEY, SECRET_ACCESS_KEY, 'ptb2-olusegun2', 'my_bucket')

Mounting ptb2-olusegun2
/mnt/my_bucket has been unmounted.
The bucket ptb2-olusegun2 was mounted to my_bucket 



In [0]:
# Saving the data as a csv in the s3 bucket 
# remove header for athena
df_clean.write.option('header','false').csv('/mnt/my_bucket/demo/data.csv')

         

In [0]:

# save the predictions as a Parquet file
predictions.write.parquet('/mnt/my_bucket/demo/predictions.parquet')

## Conclusion

#### This project is focused on analyzing Twitter data during Black Friday using machine learning techniques, specifically sentiment analysis. The aim is to gain insights into consumer behavior and preferences during this significant shopping event.

#### The project follows a step-by-step approach, starting with mounting the data on tweets from the WeCloudData public dataset bucket, creating a Spark session and Spark DataFrame, creating a sentiment column, cleaning the text, performing feature transformation, and model training and evaluation. Finally, the data and predictions are saved to the user's bucket.

#### This project demonstrates the power of machine learning in analyzing unstructured data from social media platforms like Twitter to provide valuable insights into consumer behavior, which can help businesses and policymakers make informed decisions.