## Twitter Gender Classification
### Final Project

#### University of California, Santa Barbara
#### PSTAT 135: Big Data Analytics

Source: https://www.kaggle.com/crowdflower/twitter-user-gender-classification

## Dataset Preprocessing

The dataset contains about 20,000 rows, each with a user name, a random tweet, account profile and image, location, and link and sidebar color. All tweets were posted on October 26, 2015.

In [1]:
from pyspark.sql import SparkSession
spark = SparkSession \
        .builder \
        .appName("comm") \
        .getOrCreate()

tweets = spark.read.csv('gender_data.csv', header = True)

In [2]:
type(tweets)

pyspark.sql.dataframe.DataFrame

In [3]:
tweets.count()

24230

In [4]:
tweets.columns

['_unit_id',
 '_golden',
 '_unit_state',
 '_trusted_judgments',
 '_last_judgment_at',
 'gender',
 'gender:confidence',
 'profile_yn',
 'profile_yn:confidence',
 'created',
 'description',
 'fav_number',
 'gender_gold',
 'link_color',
 'name',
 'profile_yn_gold',
 'profileimage',
 'retweet_count',
 'sidebar_color',
 'text',
 'tweet_coord',
 'tweet_count',
 'tweet_created',
 'tweet_id',
 'tweet_location',
 'user_timezone']

**We removed columns `_golden`, `_unit_state`, `_last_judgment_at`, `gender:confidence`, `profile_yn`, `profile_yn:confidence`, `link_color`, `profile_yn_gold`, `profileimage`, and `sidebar_color` because they were not relevant in our model/purpose.**

In [5]:
columns_to_drop = ['_unit_id','_golden','_unit_state','_trusted_judgments','_last_judgment_at',
                   'gender:confidence','profile_yn','profile_yn:confidence','gender_gold',
                   'profile_yn_gold','link_color','profileimage','sidebar_color','description','name',
                   'tweet_coord','tweet_location','user_timezone','tweet_id']
tweets = tweets.drop(*columns_to_drop)
tweets.columns

['gender',
 'created',
 'fav_number',
 'retweet_count',
 'text',
 'tweet_count',
 'tweet_created']

**In the `gender` column, we found that there were other labels besides `male`, `female`, and `brand`. Therefore, we are only going to keep rows where they have 1 of these 3 labels.**

In [6]:
tweets = tweets.filter((tweets.gender == 'male') | (tweets.gender == 'female') | (tweets.gender == 'brand'))

In [7]:
tweets.count()

18836

**From above, we see that after filtering gender, we have 18,836 rows. Next, we will check to see that each row has non-empty or non-null values in column `text`.**

In [8]:
# Number of non-null values in "text"
tweets.filter(tweets.text.isNotNull()).count()

17748

We see that we have 17,748 non-null values, so there is a presence of null values in this column. Thus, we will remove rows with null values, since actual text is important in our model.

In [9]:
tweets = tweets.filter(tweets.text.isNotNull())

**For columns `fav_number`, `retweet_count`, and `tweet_count`, we will check to see if there are null values.** 

- If more than 10% of the values are null, we will not include that column later on in our model. 

- If less than 10% of the values are null, we will replace null values with the median of that specific column. We chose median because there is a possibility the mean could be something like 77.5, and you cannot retweet something 77.5 times.

In [10]:
# Number of null values in "fav_number"
tweets.filter(tweets.fav_number.isNull()).count()

0

`fav_number` : Since there are no null values, there is no need to do any replacing.

In [11]:
# Number of null values in "retweet_count"
tweets.filter(tweets.retweet_count.isNull()).count()

0

`retweet_count` : Since there are no null values, there is no need to do any replacing. 

In [12]:
# Number of null values in "tweet_count"
tweets.filter(tweets.tweet_count.isNull()).count()

1259

`tweet_count` : There appears to be 1,259 null values. Since null values make up less than 10% of this column, we will replace these null values with the median of this column during pipeline building.

### Creating "punc", "emojis", "account_years"

Getting punctuation counts for each tweet

In [13]:
from string import punctuation
from collections import Counter
def punc_func(tweet):
    sum = 0
    for key in tweet:
        if key in punctuation:
            sum += tweet[key]
    return sum

In [14]:
punc = tweets.select("text").rdd.map(lambda tweet: Counter(str(tweet))) \
            .map(lambda x: punc_func(x)).map(lambda x: (x, )).toDF(['punc'])
punc.show(5)

+----+
|punc|
+----+
|  11|
|  18|
|   5|
|  19|
|  15|
+----+
only showing top 5 rows



Getting boolean values if an emoji, represented by the character '�', is present in a tweet (1=True) or not present in a tweet(0=False)

In [15]:
emojis = tweets.select("text").rdd.map(lambda tweet: str(tweet)) \
            .map(lambda x: '�' in x).map(lambda x: (x, )).toDF(['emojis'])
emojis.show(5)

+------+
|emojis|
+------+
| false|
|  true|
| false|
| false|
|  true|
+------+
only showing top 5 rows



Getting the number of years each Twitter user has had there account.

In [16]:
# extracting only the "year" part of 'created' & subtracting it from 15 (representing 2015)
years = tweets.rdd.map(lambda row: row['created'].split(' ')) \
                .map(lambda x: x[0]) \
                .map(lambda x: x.split('/')) \
                .map(lambda x: x[2]) \
                .map(lambda x: (15-int(x))) \
                .map(lambda x: (x, )).toDF(['account_years'])
years.show(5)

+-------------+
|account_years|
+-------------+
|            2|
|            3|
|            1|
|            6|
|            1|
+-------------+
only showing top 5 rows



### Convert Columns to type Integer

In [17]:
tweets = tweets.withColumn("tweet_count",tweets["tweet_count"].cast('integer'))
tweets = tweets.withColumn("fav_number",tweets["fav_number"].cast('integer'))
tweets = tweets.withColumn("retweet_count",tweets["retweet_count"].cast('integer'))

### Lowercase tweets

In [18]:
from pyspark.sql.functions import lower, col
tweets = tweets.withColumn("text",lower(col('text')))
tweets.select("text").show(1, truncate=False)

+-------------------------------------------------------------------------------------------------------------+
|text                                                                                                         |
+-------------------------------------------------------------------------------------------------------------+
|robbie e responds to critics after win against eddie edwards in the #worldtitleseries https://t.co/nsybbmvjkz|
+-------------------------------------------------------------------------------------------------------------+
only showing top 1 row



## Configure pipeline pt.1

In [19]:
from pyspark.ml import Pipeline  
from pyspark.ml.feature import * 

In [20]:
# impute median to null values in "tweet_count"
imputer = Imputer(inputCols=["tweet_count"], outputCols=["tweet_count_new"]).setStrategy("median")

# convert "gender" to numeric type (0 = female, 1 = male, 2 = brand)
indexer = StringIndexer(inputCol="gender", outputCol="gender_num")

# process "text"
tokenizer = Tokenizer(inputCol="text", outputCol="words") #tokenize
remover = StopWordsRemover(inputCol="words", outputCol="filtered") #remove stop words

# build the pipeline
pipeline = Pipeline(stages=[imputer, indexer, tokenizer, remover])

# fit & tranform the pipeline
tweets_new = pipeline.fit(tweets).transform(tweets)

### Stemming words

In [21]:
from pyspark.sql.functions import udf
from nltk.stem.snowball import SnowballStemmer
from pyspark.sql.types import ArrayType, StringType

stemmer = SnowballStemmer(language='english')
stemmer_udf = udf(lambda tokens: [stemmer.stem(token) for token in tokens], ArrayType(StringType()))
stemmed = tweets_new.withColumn("stemmed", stemmer_udf("filtered")).select("stemmed")

In [22]:
stemmed.show(1)

+--------------------+
|             stemmed|
+--------------------+
|[robbi, e, respon...|
+--------------------+
only showing top 1 row



### Adding "stemmed" "punc", "emojis", "account_years" to "tweets_new"

In [23]:
from pyspark.sql.functions import monotonically_increasing_id, row_number
from pyspark.sql.window import Window

In [24]:
tweets_new = tweets_new.withColumn('row_index', row_number().over(Window.orderBy(monotonically_increasing_id())))
punc_new = punc.withColumn('row_index', row_number().over(Window.orderBy(monotonically_increasing_id())))
emojis_new = emojis.withColumn('row_index', row_number().over(Window.orderBy(monotonically_increasing_id())))
years_new=years.withColumn('row_index', row_number().over(Window.orderBy(monotonically_increasing_id())))
stemmed_new = stemmed.withColumn('row_index', row_number().over(Window.orderBy(monotonically_increasing_id())))

In [25]:
tweets_final = tweets_new.join(punc_new, on=["row_index"]).join(emojis_new, on=["row_index"]) \
            .join(years_new, on=["row_index"]).join(stemmed_new, on=["row_index"])

In [26]:
tweets_final.columns

['row_index',
 'gender',
 'created',
 'fav_number',
 'retweet_count',
 'text',
 'tweet_count',
 'tweet_created',
 'tweet_count_new',
 'gender_num',
 'words',
 'filtered',
 'punc',
 'emojis',
 'account_years',
 'stemmed']

## Configure pipeline pt.2 & vector assembler

In [27]:
# process "stemmed"
htf = HashingTF(inputCol="stemmed", outputCol="tf", numFeatures=10000) #term frequencies for words

# create & process "bigrams"
bigram = NGram(n=2, inputCol="stemmed", outputCol="bigrams") #bigrams
bhtf = HashingTF(inputCol="bigrams", outputCol="btf", numFeatures=10000) #term frequencies for bigrams

# package together all features
va = VectorAssembler(inputCols=["tf", "btf", "punc", "emojis", "tweet_count_new", "fav_number", "retweet_count",
                               "account_years"], outputCol="features")

# scalling the features
scaler = StandardScaler(inputCol="features", outputCol="scaledFeatures")

# build the pipeline
pipeline = Pipeline(stages=[htf, bigram, bhtf, va, scaler])

# fit & tranform the pipeline
df = pipeline.fit(tweets_final).transform(tweets_final)

In [28]:
df.columns

['row_index',
 'gender',
 'created',
 'fav_number',
 'retweet_count',
 'text',
 'tweet_count',
 'tweet_created',
 'tweet_count_new',
 'gender_num',
 'words',
 'filtered',
 'punc',
 'emojis',
 'account_years',
 'stemmed',
 'tf',
 'bigrams',
 'btf',
 'features',
 'scaledFeatures']

In [29]:
final = df.select("scaledFeatures", "gender_num") \
        .withColumnRenamed('gender_num', 'label').withColumnRenamed('scaledFeatures', 'features')
final.cache()

DataFrame[features: vector, label: double]

In [30]:
final.columns

['features', 'label']

### Splitting the Dataset

In [31]:
(training, test) = final.randomSplit([0.7,0.3])

In [None]:
#sc = spark.sparkContext
#sc.setCheckpointDir('checkpoint')
#training.checkpoint()

### Model 1: Multinomial Logistic Regression

In [32]:
from pyspark.ml.classification import LogisticRegression

lr = LogisticRegression(maxIter=10, regParam=0.3, elasticNetParam=0.8)
lrModel = lr.fit(training)
prediction_lr = lrModel.transform(test)

In [33]:
trainingSummary = lrModel.summary

In [34]:
trainingSummary.accuracy

0.3615910004017678

### Model 2: Random Forest

In [None]:
from pyspark.ml.classification import RandomForestClassifier

rf = RandomForestClassifier(labelCol="label", featuresCol="features", numTrees=10)
model = rf.fit(training)

In [None]:
# Make predictions
predictions = model.transform(test)
predictions.select("predictedLabel", "label", "features").show(5)

# Select (prediction, true label) and compute test error
evaluator = MulticlassClassificationEvaluator(labelCol="indexedLabel", 
                                              predictionCol="prediction", metricName="accuracy")
accuracy = evaluator.evaluate(predictions)
print("Test Error = %g" % (1.0 - accuracy))

### Model 3: KMeans

In [35]:
from pyspark.ml.clustering import KMeans
from pyspark.ml.evaluation import ClusteringEvaluator

kmeans = KMeans().setK(3).setSeed(1)
model = kmeans.fit(training)

In [36]:
# Make predictions
predictions = model.transform(test)

# Evaluate clustering by computing Silhouette score
evaluator = ClusteringEvaluator()

silhouette = evaluator.evaluate(predictions)
print("Silhouette with squared euclidean distance = " + str(silhouette))

# Shows the result.
centers = model.clusterCenters()
print("Cluster Centers: ")
for center in centers:
    print(center)

Silhouette with squared euclidean distance = 0.10847332711235627
Cluster Centers: 
[0.02906618 0.02394712 0.03000021 ... 0.35524774 0.03348663 1.31634585]
[0.00000000e+00 0.00000000e+00 0.00000000e+00 ... 1.35205825e-03
 0.00000000e+00 1.39111408e+00]
[0.00000000e+00 0.00000000e+00 0.00000000e+00 ... 1.59065676e-03
 0.00000000e+00 2.31852347e+00]


### Model 4: NaiveBayes

In [37]:
from pyspark.ml.classification import NaiveBayes

nb = NaiveBayes(smoothing=1.0, modelType="multinomial")
model = nb.fit(training)
predictions = model.transform(test)

In [39]:
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
# compute accuracy on the test set
evaluator = MulticlassClassificationEvaluator(labelCol="label", predictionCol="prediction",
                                              metricName="accuracy")
accuracy = evaluator.evaluate(predictions)
print("Test set accuracy = " + str(accuracy))

Test set accuracy = 0.444653969451254
