# Semantic Analysis on Youtube Comments

>Ming Zhao <br>
>Jan 18, 2022

The dataset contains comments for videos related to animals and/or pets. The comments are comma separated, with a header line defining the field names: 
    1. creator_name: name of the YouTube channel creator. 
    2. userid: integer identifier for the users commenting on the YouTube channels.
    3. comment: text of the comments made by the users.  

**Step 1**: Exploring and Preprocessing Data
* Loading and Exploring Data. 
* Labeling data: Identify Cat/Dog Owners.
* Tokenizing Text: Split text into smaller units.
* Removing Stopwords: Exclude words without significant meaning.
* Splitting Data: Divde dataset into training and testing.

**Step 2**: Building and Evaluating Classifiers
* Build classifiers for the cat/dog owners. <br>
    -- Logistic Regression <br>
    -- Random Forest <br>
    -- Gradient-Boosted Trees <br>
* Measure the performance of the classifiers.

**Step 3**: Classifying All Users
* Apply the cat/dog classifiers to all the users in the dataset. 
* Estimate the fraction of all users who are cat/dog owners.

**Step 4**: Extracting Insights about Cat/Dog Owners
* Find topics important to cat/dog owners.

**Step 5**: Identifying Creators with Cat/Dog Owners in the Audience 
* Find creators with most comments by cat/dog owners. 


In [1]:
import warnings
warnings.filterwarnings("ignore")
warnings.simplefilter("ignore")

In [2]:
import findspark
findspark.init()
from pyspark.sql import SparkSession
spark = SparkSession \
        .builder \
        .appName("Python Spark SQL") \
        .config("spark.some.config.option", "some-value") \
        .getOrCreate()

22/09/02 04:40:04 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [3]:
from pyspark.sql.functions import when, col
from pyspark.sql.functions import regexp_replace
from pyspark.sql.functions import udf
from pyspark.sql.types import *

In [4]:
from pyspark.ml import Pipeline
from pyspark.ml.feature import RegexTokenizer 
from pyspark.ml.feature import StopWordsRemover
from pyspark.ml.feature import Word2Vec
from pyspark.ml.feature import CountVectorizer
from nltk.stem.snowball import SnowballStemmer

In [5]:
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml.classification import GBTClassifier
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder
from pyspark.ml.clustering import LDA

## 1. Data Exploration and Preprocessing

### Data Loadiing

In [6]:
df_raw=spark.read.csv("animals_comments.csv",inferSchema=True,header=True)

                                                                                

### Data Exploration

In [7]:
df_raw.show(5)

+-------------+------+--------------------+
| creator_name|userid|             comment|
+-------------+------+--------------------+
| Doug The Pug|  87.0|I shared this to ...|
| Doug The Pug|  87.0|  Super cute  😀🐕🐶|
|  bulletproof| 530.0|stop saying get e...|
|Meu Zoológico| 670.0|Tenho uma jiboia ...|
|       ojatro|1031.0|I wanna see what ...|
+-------------+------+--------------------+
only showing top 5 rows



In [29]:
df_raw.count()

                                                                                

5786944

In [9]:
df_raw.dtypes

[('creator_name', 'string'), ('userid', 'double'), ('comment', 'string')]

In [10]:
# count missing values
for column in df_raw.columns:
    print(df_raw.filter(df_raw[column].isNull()).count())

# drop missing values    
for column in df_raw.columns:
    df_raw=df_raw.filter(df_raw[column].isNotNull())

                                                                                

32050


                                                                                

565


                                                                                

1051


### Data Labeling

In [11]:
df_raw=df_raw.withColumn("label", \
        (when(col("comment").like("%my dog%"),1).when(col("comment").like("%my dogs%"),1) \
        .when(col("comment").like("%my puppy%"),1).when(col("comment").like("%my puppies%"),1) \
        .when(col("comment").like("%my pup%"),1).when(col("comment").like("%mu pups%"),1) \
        .when(col("comment").like("%i have a dog%"),1).when(col("comment").like("%i have dogs%"),1) \
        .when(col("comment").like("%i have a puppy%"),1).when(col("comment").like("%i have puppies%"),1) \
        .when(col("comment").like("%i have a pup%"),1).when(col("comment").like("%i have pups%"),1) \
        .when(col("comment").like("%my cat%"),1).when(col("comment").like("%my cats%"),1) \
        .when(col("comment").like("%my kitty%"),1).when(col("comment").like("%my kitties%"),1) \
        .when(col("comment").like("%my kitten%"),1).when(col("comment").like("%my kittens%"),1) \
        .when(col("comment").like("%i have a cat%"),1).when(col("comment").like("%i have cats%"),1) \
        .when(col("comment").like("%i have a kitty%"),1).when(col("comment").like("%i have kitties%"),1) \
        .when(col("comment").like("%i have a kitten%"),1).when(col("comment").like("%i have kittens%"),1) \
        .when(col("comment").like("%my baby dog%"),1).when(col("comment").like("%my baby dogs%"),1) \
        .when(col("comment").like("%my baby cat%"),1).when(col("comment").like("%my baby cats%"),1) \
        .when(col("comment").like("%my little dog%"),1).when(col("comment").like("%my little cat%"),1) \
        .otherwise(0)))

In [12]:
df_raw.groupBy("label").count().show()



+-----+-------+
|label|  count|
+-----+-------+
|    1|  40215|
|    0|5746729|
+-----+-------+



                                                                                

In [18]:
df_raw.select("comment").filter(df_raw["label"]==1).show(5, False)

+---------------------------------------------------------------------------------------------------------+
|comment                                                                                                  |
+---------------------------------------------------------------------------------------------------------+
|Now I want to try that with my dog!!!                                                                    |
|I blow smoke in my cats ear right to his brain                                                           |
|my dog lucky wont eat of his bowl hell only eat out peoples hands how do i get him to eat out of his bowl|
|thats what my dog do                                                                                     |
|Im so happy i think Im almost crying Im hugging my dog Ik its not a cat but its a animal that need love  |
+---------------------------------------------------------------------------------------------------------+
only showing top 5 rows



### Data Selection

In [19]:
# Select all dog/cat owners
label1_data = df_raw.filter(df_raw["label"]==1) 
#label1_data= df_raw.filter(col("label")==1) 
    
# Select part of non-dog/cat-owners
label0_data, label0_rest = df_raw.filter(df_raw["label"]==0).randomSplit([0.05,0.95],seed=1991)

# Combine selected observations
all_data = label1_data.union(label0_data)

In [20]:
# For faster speed
data, rest = all_data.randomSplit([0.1,0.9],seed=1991)

In [21]:
data_count = data.count()
owner_count = data.filter(data["label"]==1).count()
print("Dataset Count: " + str(data_count))
print("Proportion of Dog/Cat Owners in Dataset: " + str(round(owner_count/data_count*100,2)) + "%")



Dataset Count: 32764
Proportion of Dog/Cat Owners in Dataset: 12.13%




### Pipeline
* Tokenizing Texts
* Removing Stop-Words
* Computing distributed Representation of Words 

In [22]:
# Remove numbers
data = data.withColumn('comment', regexp_replace(data.comment, '[0-9]', ''))

In [23]:
# Tokenize, removing stopwords, take sequences of words representing documents
regexTokenizer = RegexTokenizer(inputCol="comment", outputCol="words", pattern="\\W") # pattern="\\W" removes punctuations and emojis                                                     
remover = StopWordsRemover(inputCol="words", outputCol="words_clean")
word2Vec = Word2Vec(inputCol="words_clean", outputCol="features")

In [24]:
# Build a pipeline
pipeline = Pipeline(stages=[regexTokenizer, remover, word2Vec])

In [25]:
# Fit the pipeline on data
pipeline_fit = pipeline.fit(data)

22/09/02 04:48:56 WARN InstanceBuilder$NativeBLAS: Failed to load implementation from:dev.ludovic.netlib.blas.JNIBLAS
22/09/02 04:48:56 WARN InstanceBuilder$NativeBLAS: Failed to load implementation from:dev.ludovic.netlib.blas.ForeignLinkerBLAS
                                                                                

In [26]:
# Transform the data to a new one
df = pipeline_fit.transform(data)

In [21]:
df.select("label", "words", "words_clean", "features").show(5)

+-----+--------------------+--------------------+--------------------+
|label|               words|         words_clean|            features|
+-----+--------------------+--------------------+--------------------+
|    1|[actually, i, use...|[actually, use, b...|[0.01119084873353...|
|    1|[creo, que, el, s...|[creo, que, el, s...|[-0.3080230177276...|
|    1|[wish, my, cat, d...|         [wish, cat]|[0.00823379866778...|
|    1|[when, my, cat, n...|[cat, needs, food...|[0.02935355461456...|
|    1|[taj, at, looks, ...|[taj, looks, like...|[0.00210698095615...|
+-----+--------------------+--------------------+--------------------+
only showing top 5 rows



### Data Splitting

In [22]:
# Split data into training and testing sets
train, test = df.randomSplit([0.8,0.2],seed=1991)

In [23]:
train_count = train.count()
owner_count = train.filter(train["label"]==1).count()
print("Training Dataset Count: " + str(train_count))
print("Proportion of Dog/Cat Owners in Training Dataset: " + str(round(owner_count/train_count*100,2)) + "%")

Training Dataset Count: 26194
Proportion of Dog/Cat Owners in Training Dataset: 12.24%


## 2. Model Training and Selection

### Logistic Regression

In [24]:
# Build a logistic regression model
lr = LogisticRegression(featuresCol="features", labelCol="label")

# Tune model using ParamGridBuilder
paramGrid = (ParamGridBuilder()
             .addGrid(lr.maxIter, [10, 20, 30])
             .addGrid(lr.regParam, [0.01, 0.1, 0.5, 1])
             .addGrid(lr.elasticNetParam,[0, 0.5, 1]) 
             .build())

# Define evaluator
evaluator = MulticlassClassificationEvaluator(metricName="accuracy",
                                              labelCol="label",
                                              predictionCol="prediction")


# Build cross validation
cv = CrossValidator(estimator=lr, 
                    estimatorParamMaps=paramGrid, 
                    evaluator=evaluator,
                    numFolds=5, seed=1991)

# Fit LR model to training data
cv_model = cv.fit(train)

# Extract best model
best_LR_model = cv_model.bestModel

In [25]:
# Extract model hyperparameters
print("Regularization Parameter: {}".format(best_LR_model.getRegParam()))
print("Maxmum Iterations: {}".format(best_LR_model.getMaxIter()))
print("Regularization Method: {}".format(best_LR_model.getElasticNetParam())) 
#1 for Lasso(L1); 0 for Ridge(L2)

Regularization Parameter: 0.01
Maxmum Iterations: 30
Regularization Method: 0.0


In [26]:
# Evaluate predictions on training data
LR_prediction_train = best_LR_model.transform(train)
LR_accuracy_train = evaluator.evaluate(LR_prediction_train)
print("Accuracy for Training Data: {}".format(round(LR_accuracy_train,2)))

Accuracy for Training Data: 0.88


### Random Forest

In [27]:
# Build a gradient-boosted tree model
rf = RandomForestClassifier(featuresCol="features", labelCol="label", seed=1991)

# Tune model using ParamGridBuilder
paramGrid = (ParamGridBuilder()
             .addGrid(rf.numTrees, [20, 50, 100])
             .addGrid(rf.maxDepth, [2, 5, 10])
             .build())

# Define evaluator
evaluator = MulticlassClassificationEvaluator(metricName="accuracy",
                                              labelCol="label",
                                              predictionCol="prediction")

# Build cross validation
cv = CrossValidator(estimator=rf, 
                    estimatorParamMaps=paramGrid, 
                    evaluator=evaluator,
                    numFolds=5, seed=1991)

# Fit GBT model to training data
cv_model = cv.fit(train)

# Extract best model
best_RF_model = cv_model.bestModel

In [28]:
# Extract model hyperparameters
print("Number of Trees: {}".format(best_RF_model.getNumTrees))
print("Maximum Depth of The Tree: {}".format(best_RF_model.getMaxDepth())) 

Number of Trees: 20
Maximum Depth of The Tree: 10


In [29]:
# Evaluate predictions on training data
RF_prediction_train = best_RF_model.transform(train)
RF_accuracy_train = evaluator.evaluate(RF_prediction_train)
print("Accuracy for Training Data: {}".format(round(RF_accuracy_train,2)))

Accuracy for Training Data: 0.93


### Gradient-Boosted Trees

**Random Forest** and **GBT** are ensemble learning algorithms, which combine multiple decision trees to produce more powerful models. <br>

Ensemble learning algorithms build upon other machine learning methods by combining models. The combination can be more powerful and accurate than any of the individual models. <br>

The two algorithms use **Decision Trees** as the base models. The main difference is the order in which each component tree is trained. <br>

**Random Forest** trains each tree independently using a random sample of the data. Since each tree in the random forest is trained independently, multiple trees can be trained in parallel. Such method refers to bootstrap aggregating, or **bagging** (independent models). <br>

**GBT** trains one tree at a time, where each new tree helps to correct errors made by previously trained trees. Since GBT must train one tree at a time, training is only parallelized at the single tree level. Such method refers to **boosting** (sequential models). <br>

In the end, both methods produce a weighted collection of **Decision Trees**. The ensemble model makes predictions by combining results from the individual trees. <br>


In [30]:
from IPython.display import Image

In [31]:
Image(url= "https://miro.medium.com/max/1400/1*-qr7ugMY0tvQOE1aoD3lVg.png"
,width=500, height=300)

In [32]:
Image(url= "https://miro.medium.com/max/4800/1*9MlWByZxGnVDVKvJpl3IXg.png"
,width=500, height=300)

In [33]:
# Build a gradient-boosted tree model
gbt = GBTClassifier(featuresCol="features", labelCol="label", seed=1991)

# Tune model using ParamGridBuilder
paramGrid = (ParamGridBuilder()
             .addGrid(gbt.maxIter, [10, 20, 30])
             .addGrid(gbt.maxDepth, [5, 10, 20])
             .build())

# Define evaluator
evaluator = MulticlassClassificationEvaluator(metricName="accuracy",
                                              labelCol="label",
                                              predictionCol="prediction")

# Build cross validation
cv = CrossValidator(estimator=gbt, 
                    estimatorParamMaps=paramGrid, 
                    evaluator=evaluator,
                    numFolds=5, seed=1991)

# Fit GBT model to training data
cv_model = cv.fit(train)

# Extract best model
best_GBT_model = cv_model.bestModel

In [34]:
# Extract model hyperparameters
print("Maxmum Iterations: {}".format(best_GBT_model.getMaxIter()))
print("Maximum Depth of The Tree: {}".format(best_GBT_model.getMaxDepth())) 

Maxmum Iterations: 30
Maximum Depth of The Tree: 5


In [35]:
# Evaluate predictions on training data
GBT_prediction_train = best_GBT_model.transform(train)
GBT_accuracy_train = evaluator.evaluate(GBT_prediction_train)
print("Accuracy for Training Data: {}".format(round(GBT_accuracy_train,2)))

Accuracy for Training Data: 0.92


### Model Performance on Testing Set

**Accuracy**, **precision**, and **recall** for RF and GBT are all higher than those for LR. RF and BST have similar accuracy, but RF has higher **precision** and GBT has higher **recall**. As we would like to correctly identify dog/cat owners, **recall** is more significant in this situation, and we conclude that **GBT** is more appropriate for predicting the result based on this data than the other two models. 

In [36]:
# Predict on test data
LR_prediction_test = best_LR_model.transform(test)

In [37]:
LR_prediction_test.select("label", "rawPrediction", "probability", "prediction").show(5)

+-----+--------------------+--------------------+----------+
|label|       rawPrediction|         probability|prediction|
+-----+--------------------+--------------------+----------+
|    1|[1.56259321466050...|[0.82672514754066...|       0.0|
|    1|[-0.0447241139494...|[0.48882083487423...|       1.0|
|    1|[1.65312199092632...|[0.83931255011807...|       0.0|
|    1|[-0.7763730808973...|[0.31510209608705...|       1.0|
|    1|[0.87093261564528...|[0.70493971885251...|       0.0|
+-----+--------------------+--------------------+----------+
only showing top 5 rows



In [38]:
# Predict on test data
RF_prediction_test = best_RF_model.transform(test)

In [39]:
RF_prediction_test.select("label", "rawPrediction", "probability", "prediction").show(5)

+-----+--------------------+--------------------+----------+
|label|       rawPrediction|         probability|prediction|
+-----+--------------------+--------------------+----------+
|    1|[18.9322736322717...|[0.94661368161358...|       0.0|
|    1|[6.87592279255087...|[0.34379613962754...|       1.0|
|    1|[15.8015584919507...|[0.79007792459753...|       0.0|
|    1|[6.18069803726720...|[0.30903490186336...|       1.0|
|    1|[11.4144996489556...|[0.57072498244778...|       0.0|
+-----+--------------------+--------------------+----------+
only showing top 5 rows



In [40]:
# Predict on test data
GBT_prediction_test = best_GBT_model.transform(test)

In [41]:
GBT_prediction_test.select("label", "rawPrediction", "probability", "prediction").show(5)

+-----+--------------------+--------------------+----------+
|label|       rawPrediction|         probability|prediction|
+-----+--------------------+--------------------+----------+
|    1|[1.30976003468303...|[0.93210734115254...|       0.0|
|    1|[-0.2455651688344...|[0.37962732516163...|       1.0|
|    1|[1.07388398205819...|[0.89546001728828...|       0.0|
|    1|[-0.5926512762559...|[0.23410012298118...|       1.0|
|    1|[0.28808282795089...|[0.64018464740453...|       0.0|
+-----+--------------------+--------------------+----------+
only showing top 5 rows



In [42]:
def model_evaluation(predictions):
    
  TP = predictions[(predictions["label"] == 1) & (predictions["prediction"] == 1.0)].count()
  FP = predictions[(predictions["label"] == 0) & (predictions["prediction"] == 1.0)].count()
  TN = predictions[(predictions["label"] == 0) & (predictions["prediction"] == 0.0)].count()
  FN = predictions[(predictions["label"] == 1) & (predictions["prediction"] == 0.0)].count()

  accuracy = (TP + TN) / (TP + FP + TN + FN)
  precision = TP / (TP + FP)
  recall = TP / (TP + FN)

  print ("True Positives:", TP)
  print ("False Positives:", FP)
  print ("True Negatives:", TN)
  print ("False Negatives:", FN)
  print ("Accuracy:", accuracy)
  print ("Precision:", precision)
  print ("Recall:", recall)

In [43]:
print("Result Evaluation Metrics for LR Model on Testing Data:")
model_evaluation(LR_prediction_test)

Result Evaluation Metrics for LR Model on Testing Data:
True Positives: 153
False Positives: 94
True Negatives: 5533
False Negatives: 636
Accuracy: 0.8862219451371571
Precision: 0.6194331983805668
Recall: 0.19391634980988592


In [44]:
print("Result Evaluation Metrics for RF Model on Testing Data:")
model_evaluation(RF_prediction_test)

Result Evaluation Metrics for RF Model on Testing Data:
True Positives: 266
False Positives: 106
True Negatives: 5521
False Negatives: 523
Accuracy: 0.9019638403990025
Precision: 0.7150537634408602
Recall: 0.33713561470215464


In [45]:
print("Result Evaluation Metrics for GBT Model on Testing Data:")
model_evaluation(GBT_prediction_test)

Result Evaluation Metrics for GBT Model on Testing Data:
True Positives: 361
False Positives: 178
True Negatives: 5449
False Negatives: 428
Accuracy: 0.9055486284289277
Precision: 0.6697588126159555
Recall: 0.45754119138149557


## 3. Estimated Classification of Dog/Cat Owners

$
\text{Estimated fraction of dog/cat owners}=\frac{\#\text{ of labeled owners }+ \space \#\text{ of predicted owners without label}}{\#\text{ of total users}}
$

In [46]:
# Predict on whole data
predictions = best_GBT_model.transform(df)

In [47]:
labeled = df.filter(df["label"]==1).count()
predicted = predictions.filter((col("label")==0)&(col("prediction")==1.0)).count()
fraction = (labeled+predicted)/data_count

In [48]:
print("The labeled fraction of the cat/dog owners is " + str(round(labeled/data_count,3)))

The labeled fraction of the cat/dog owners is 0.123


In [49]:
print("The estimated fraction of the cat/dog owners is " + str(round(fraction,3)))

The estimated fraction of the cat/dog owners is 0.145


## 4. Analysis on Topics of Comments

Identification of Important Topics to Dog/Cat Owners

In [50]:
# Select labeled and predicted owners 
df_owner = predictions.filter((col("label")==1)|(col("prediction")==1.0)).select("userid","words_clean")

In [51]:
# Stemming words: break words down to their roots
stemmer = SnowballStemmer("english")
stemmer_udf = udf(lambda tokens: [stemmer.stem(t) for t in tokens], ArrayType(StringType()))
df_stemmed = df_owner.withColumn("words_stemmed", stemmer_udf("words_clean"))

In [52]:
df_owner = df_stemmed.select('userid', 'words_stemmed')

In [53]:
# Fit a countvectorizer model
cv = CountVectorizer(inputCol="words_stemmed", outputCol="features", 
                     minTF=1, # minimum number of a word must appear in a document
                     minDF=5) # minimum number of documents that a word must appear in
cv_model = cv.fit(df_owner)
count_vectors = cv_model.transform(df_owner).select("userid", "features")

In [54]:
# Trains a LDA model
num_topics = 10
top_n_words=10

lda = LDA(k=num_topics, maxIter=20)
LDA_model = lda.fit(count_vectors)

In [55]:
# Obtain indices of top n words
topics = LDA_model.describeTopics(maxTermsPerTopic=top_n_words)



In [56]:
# Obtain a list of words
vocabArray = cv_model.vocabulary
len(vocabArray)

1667

In [57]:
# Match words with index 
index_to_words = udf(lambda word_index: list([vocabArray[i] for i in word_index]))
key_words = topics.select(index_to_words(topics.termIndices).alias('words'))

In [58]:
# Show topic*word matrix
key_words.show(num_topics, False)

+--------------------------------------------------------------------+
|words                                                               |
+--------------------------------------------------------------------+
|[gotcha, cat, love, game, name, everyth, special, seem, bunni, talk]|
|[daisi, bunni, grate, cream, cone, ice, like, hour, dog, took]      |
|[kitten, enter, zone, meow, cat, dog, butt, fur, know, theyr]       |
|[dog, like, game, hate, high, fuck, th, danc, feel, put]            |
|[dog, cat, like, love, get, one, look, im, got, know]               |
|[dog, like, outsid, clean, look, vet, get, puppi, hous, toy]        |
|[dog, cat, gibson, marm, swim, like, snack, fed, want, support]     |
|[dog, light, bed, hernia, give, control, grow, sale, want, dont]    |
|[cat, minut, time, dog, one, year, com, belli, tree, get]           |
|[dog, pls, love, vaccin, puppi, hurt, need, want, sooo, one]        |
+--------------------------------------------------------------------+



## Identify Popular Creators for Dog/Cat Owners

Who has the largest number of comments made by cat/dog owners?

In [59]:
predictions.createOrReplaceTempView("predictions")

In [60]:
creators = spark.sql("SELECT creator_name, count(creator_name) AS number \
                      FROM (SELECT creator_name FROM predictions WHERE label=1 OR prediction=1.0) \
                      GROUP BY creator_name \
                      ORDER BY number DESC")
creators.show()

+--------------------+------+
|        creator_name|number|
+--------------------+------+
|            The Dodo|   477|
|    Cole & Marmalade|   312|
|     Gohan The Husky|   248|
|        Robin Seplut|   239|
|Hope For Paws - O...|   235|
|Zak Georges Dog T...|   231|
|           Vet Ranch|   209|
|  Taylor Nicole Dean|   192|
|Gone to the Snow ...|   184|
|          stacyvlogs|   170|
|    Brave Wilderness|   166|
|       Brian Barczyk|   161|
|   Talking Kitty Cat|   133|
|           meow meow|    91|
|     Viktor Larkhill|    81|
|        Paws Channel|    74|
|          Funny Pets|    55|
|    SlideShow ForFun|    55|
|       Cat Man Chris|    49|
|  The Pet Collective|    47|
+--------------------+------+
only showing top 20 rows

