# Twitter Gender Classification
### Final Project

#### University of California, Santa Barbara
#### PSTAT 135: Big Data Analytics

Source: https://www.kaggle.com/crowdflower/twitter-user-gender-classification

# Dataset Preprocessing and Exploratory Data Analysis

The dataset contains about 20,000 rows, each with a user name, a random tweet, profile image and statistics, location, and link and sidebar color. All tweets were posted in 2015.

In [1]:
from pyspark.sql import SparkSession
spark = SparkSession \
        .builder \
        .appName("comm") \
        .getOrCreate()

tweets = spark.read.csv('gender_data.csv', header = True)

In [2]:
type(tweets)

pyspark.sql.dataframe.DataFrame

In [3]:
tweets.count()

24230

In [4]:
tweets.columns

['_unit_id',
 '_golden',
 '_unit_state',
 '_trusted_judgments',
 '_last_judgment_at',
 'gender',
 'gender:confidence',
 'profile_yn',
 'profile_yn:confidence',
 'created',
 'description',
 'fav_number',
 'gender_gold',
 'link_color',
 'name',
 'profile_yn_gold',
 'profileimage',
 'retweet_count',
 'sidebar_color',
 'text',
 'tweet_coord',
 'tweet_count',
 'tweet_created',
 'tweet_id',
 'tweet_location',
 'user_timezone']

We remove columns `_unit_id`, `_golden`, `_unit_state`, `_last_judgment_at`,`_trusted_judgments`, `gender:confidence`, `profile_yn`, `profile_yn:confidence`, `description`, `gender_gold`, `link_color`,`name`, `profile_yn_gold`, `profileimage`, `sidebar_color`, `tweet_coord`, `tweet_id`, `tweet_location`, and `user_timezone` because they were not relevant in our model/purpose....

In [5]:
tweets = tweets.select(['gender','created','fav_number','retweet_count','text','tweet_count','tweet_created'])
tweets.show(5)

+------+--------------+----------+-------------+--------------------+-----------+--------------+
|gender|       created|fav_number|retweet_count|                text|tweet_count| tweet_created|
+------+--------------+----------+-------------+--------------------+-----------+--------------+
|  male|  12/5/13 1:48|         0|            0|Robbie E Responds...|     110964|10/26/15 12:40|
|  male| 10/1/12 13:51|        68|            0|���It felt like t...|       7471|10/26/15 12:40|
|  male|11/28/14 11:30|      7696|            1|i absolutely ador...|       5617|10/26/15 12:40|
|  male| 6/11/09 22:39|       202|            0|Hi @JordanSpieth ...|       1693|10/26/15 12:40|
|female| 4/16/14 13:23|     37318|            0|Watching Neighbou...|      31462|10/26/15 12:40|
+------+--------------+----------+-------------+--------------------+-----------+--------------+
only showing top 5 rows



In [6]:
tweets.groupBy('gender').count().sort('count', ascending = False).show()

+--------------------+-----+
|              gender|count|
+--------------------+-----+
|              female| 6700|
|                male| 6194|
|               brand| 5942|
|                null| 3188|
|             unknown| 1117|
|          6.5874E+17|   83|
|          6.5873E+17|   62|
|              Medina|   14|
|              London|   13|
|      10/26/15 12:40|   11|
|Porto, Portugal �...|   10|
|1/16 cute pickle ...|   10|
|              0084B4|    9|
|Republic of the P...|    8|
|                  UK|    7|
|                 USA|    7|
|         3/4 + band |    6|
|              FF005E|    6|
|PRE-ORDER MADE IN...|    5|
|           New York |    4|
+--------------------+-----+
only showing top 20 rows



As we can see above, the `gender` column contains many labels. We decide to keep only the tweets made by a `male`, `female`, or `brand` since they are the most common.

In [7]:
tweets = tweets.filter((tweets.gender == 'male') | (tweets.gender == 'female') | (tweets.gender == 'brand'))
tweets.count()

18836

### Checking for Null Values

We will remove any null values in `text` since we are only interested in predicting gender for users with tweets.

In [8]:
# Number of null values in "text"
tweets.filter(tweets.text.isNull()).count()

1088

In [9]:
tweets = tweets.filter(tweets.text.isNotNull())
tweets.count()

17748

For columns `fav_number`, `retweet_count`, and `tweet_count`:

- If more than 10% of the values are null, we will not include that column later on in our model. 

- If less than 10% of the values are null, we will replace null values with the median of that specific column. We chose median because there is a possibility the mean could be something like 77.5, and you cannot retweet something 77.5 times.

- If there are no null values, we will keep the column unchanged.

In [10]:
# Percentage of null values in "fav_number"
tweets.filter(tweets.fav_number.isNull()).count()/tweets.count()

0.0

In [11]:
# Percentage of null values in "retweet_count"
tweets.filter(tweets.retweet_count.isNull()).count()/tweets.count()

0.0

In [12]:
# Percentage of null values in "tweet_count"
tweets.filter(tweets.tweet_count.isNull()).count()/tweets.count()

0.07093757043047104

Since less than 10% of `tweet_count` values are null, we will replace them with the median value of `tweet_count` during pipeline building.

### Converting Columns to Type Integer

In [13]:
tweets = tweets.withColumn("fav_number",tweets["fav_number"].cast('integer'))
tweets = tweets.withColumn("retweet_count",tweets["retweet_count"].cast('integer'))
tweets = tweets.withColumn("tweet_count",tweets["tweet_count"].cast('integer'))

### Creating New Columns for Punctuation Counts, Emoji Existence, and Account Years

Getting punctuation counts for each tweet.

In [14]:
from string import punctuation
from collections import Counter
def punc_func(tweet):
    sum = 0
    for key in tweet:
        if key in punctuation:
            sum += tweet[key]
    return sum

In [15]:
punc = tweets.select("text").rdd.map(lambda tweet: Counter(str(tweet))) \
            .map(lambda x: punc_func(x)).map(lambda x: (x, )).toDF(['punc'])
punc.show(5)

+----+
|punc|
+----+
|  11|
|  18|
|   5|
|  19|
|  15|
+----+
only showing top 5 rows



Getting boolean values if an emoji, represented by the character '�', is present in a tweet (True) or not present in a tweet (False).

In [16]:
emojis = tweets.select("text").rdd.map(lambda tweet: str(tweet)) \
            .map(lambda x: '�' in x).map(lambda x: (x, )).toDF(['emojis'])
emojis.show(5)

+------+
|emojis|
+------+
| false|
|  true|
| false|
| false|
|  true|
+------+
only showing top 5 rows



Getting the number of years each Twitter user has had their account.

In [17]:
# extracting only the "year" part of 'created' & subtracting it from 15 (representing 2015)
years = tweets.rdd.map(lambda row: row['created'].split(' ')) \
                .map(lambda x: x[0]) \
                .map(lambda x: x.split('/')) \
                .map(lambda x: x[2]) \
                .map(lambda x: (15-int(x))) \
                .map(lambda x: (x, )).toDF(['account_years'])
years.show(5)

+-------------+
|account_years|
+-------------+
|            2|
|            3|
|            1|
|            6|
|            1|
+-------------+
only showing top 5 rows



### Lowercasing Tweets

The last step before we configure our first pipeline is to lowercase the tweets in `text` column.

In [18]:
from pyspark.sql.functions import lower, col
tweets = tweets.withColumn("text",lower(col('text')))
tweets.select("text").show(1, truncate=False)

+-------------------------------------------------------------------------------------------------------------+
|text                                                                                                         |
+-------------------------------------------------------------------------------------------------------------+
|robbie e responds to critics after win against eddie edwards in the #worldtitleseries https://t.co/nsybbmvjkz|
+-------------------------------------------------------------------------------------------------------------+
only showing top 1 row



# Pipeline Building

In [19]:
from pyspark.ml import Pipeline  
from pyspark.ml.feature import * 

### Pipeline 1
For our first pipeline we:
- impute median values in `tweet_count`
- convert `gender` to numeric type
- tokenize and remove stop words in `text`

In [20]:
# impute median to null values in "tweet_count"
imputer = Imputer(inputCols=["tweet_count"], outputCols=["tweet_count_new"]).setStrategy("median")

# convert "gender" to numeric type (0 = female, 1 = male, 2 = brand)
indexer = StringIndexer(inputCol="gender", outputCol="gender_num")

# process "text"
tokenizer = Tokenizer(inputCol="text", outputCol="words") #tokenize
remover = StopWordsRemover(inputCol="words", outputCol="filtered") #remove stop words

# build pipeline
pipeline = Pipeline(stages=[imputer, indexer, tokenizer, remover])

# fit & tranform pipeline
tweets = pipeline.fit(tweets).transform(tweets)

Since ML Feature library does not support stemming, we decide to stem the `text` column using the `SnowballStemmer` package.

In [21]:
from pyspark.sql.functions import udf
from nltk.stem.snowball import SnowballStemmer
from pyspark.sql.types import ArrayType, StringType

stemmer = SnowballStemmer(language='english')
stemmer_udf = udf(lambda tokens: [stemmer.stem(token) for token in tokens], ArrayType(StringType()))
stemmed = tweets.withColumn("stemmed", stemmer_udf("filtered")).select("stemmed")
stemmed.show(1, truncate=False)

+----------------------------------------------------------------------------------------+
|stemmed                                                                                 |
+----------------------------------------------------------------------------------------+
|[robbi, e, respond, critic, win, eddi, edward, #worldtitleseri, https://t.co/nsybbmvjkz]|
+----------------------------------------------------------------------------------------+
only showing top 1 row



### Adding Columns to `tweets`
We add `stemmed`,`punc`,`emojis`, and `account_years` to our updated dataset `tweets` from pipeline 1.

In [22]:
from pyspark.sql.functions import monotonically_increasing_id, row_number
from pyspark.sql.window import Window

In [23]:
tweets_new = tweets.withColumn('row_index', row_number().over(Window.orderBy(monotonically_increasing_id())))
punc_new = punc.withColumn('row_index', row_number().over(Window.orderBy(monotonically_increasing_id())))
emojis_new = emojis.withColumn('row_index', row_number().over(Window.orderBy(monotonically_increasing_id())))
years_new=years.withColumn('row_index', row_number().over(Window.orderBy(monotonically_increasing_id())))
stemmed_new = stemmed.withColumn('row_index', row_number().over(Window.orderBy(monotonically_increasing_id())))

In [24]:
tweets = tweets_new.join(punc_new, on=["row_index"]).join(emojis_new, on=["row_index"]) \
            .join(years_new, on=["row_index"]).join(stemmed_new, on=["row_index"])

In [25]:
tweets.columns

['row_index',
 'gender',
 'created',
 'fav_number',
 'retweet_count',
 'text',
 'tweet_count',
 'tweet_created',
 'tweet_count_new',
 'gender_num',
 'words',
 'filtered',
 'punc',
 'emojis',
 'account_years',
 'stemmed']

### Pipeline 2
For our second pipeline we:
- compute term frequency vector for `stemmed`
- create `bigrams` from `stemmed`
- compute term frequency vector for `bigrams`
- combine all relevant columns into `features` using VectorAssembler
- scale `features`

In [26]:
htf = HashingTF(inputCol="stemmed", outputCol="tf", numFeatures=10000) #term frequencies for stemmed

bigrams = NGram(n=2, inputCol="stemmed", outputCol="bigrams") #create bigrams
bhtf = HashingTF(inputCol="bigrams", outputCol="btf", numFeatures=10000) #term frequencies for bigrams

# create features column
va = VectorAssembler(inputCols=["tf", "btf", "punc", "emojis", "tweet_count_new", "fav_number", "retweet_count",
                               "account_years"], outputCol="features")

# scale features
scaler = StandardScaler(inputCol="features", outputCol="scaledFeatures")

# build pipeline
pipeline = Pipeline(stages=[htf, bigrams, bhtf, va, scaler])

# fit & tranform pipeline
tweets = pipeline.fit(tweets).transform(tweets)

In [27]:
tweets.columns

['row_index',
 'gender',
 'created',
 'fav_number',
 'retweet_count',
 'text',
 'tweet_count',
 'tweet_created',
 'tweet_count_new',
 'gender_num',
 'words',
 'filtered',
 'punc',
 'emojis',
 'account_years',
 'stemmed',
 'tf',
 'bigrams',
 'btf',
 'features',
 'scaledFeatures']

### Final Dataframe
For modeling, we will be using only the `scaledFeatures`, renamed as `features`, and `gender_num`, renamed as `label`.

In [28]:
final = tweets.select("scaledFeatures", "gender_num") \
        .withColumnRenamed('gender_num', 'label').withColumnRenamed('scaledFeatures', 'features')
final.cache()

DataFrame[features: vector, label: double]

In [29]:
final.show(5)

+--------------------+-----+
|            features|label|
+--------------------+-----+
|(20006,[2398,2500...|  1.0|
|(20006,[472,488,4...|  1.0|
|(20006,[2196,2996...|  1.0|
|(20006,[62,375,11...|  1.0|
|(20006,[1927,2708...|  0.0|
+--------------------+-----+
only showing top 5 rows



# Splitting the Dataset

We will split `final` dataset into 70% training and 30% testing.

In [30]:
(training, test) = final.randomSplit([0.7,0.3])
print("Count of training set = " + str(training.count()))
print("Count of test set = " + str(test.count()))

Count of training set = 12522
Count of test set = 5226


# Model Construction and Evaluation

### Multinomial Logistic Regression

In [31]:
from pyspark.ml.classification import LogisticRegression

lr = LogisticRegression(maxIter=10, regParam=0.3, elasticNetParam=0.8)
model = lr.fit(training)
predictions_lr = model.transform(test)

In [32]:
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

evaluator = MulticlassClassificationEvaluator(labelCol="label",predictionCol="prediction", metricName="accuracy")
accuracy_lr = evaluator.evaluate(predictions_lr)
print("Accuracy = " + str(accuracy_lr))

Accuracy = 0.36203597397627246


### Random Forest

In [33]:
from pyspark.ml.classification import RandomForestClassifier

rf = RandomForestClassifier(labelCol="label", featuresCol="features", numTrees=10)
model = rf.fit(training)
predictions_rf = model.transform(test)

In [34]:
accuracy_rf = evaluator.evaluate(predictions_rf)
print("Accuracy = " + str(accuracy_rf))

Accuracy = 0.4393417527745886


### NaiveBayes

In [35]:
from pyspark.ml.classification import NaiveBayes

nb = NaiveBayes(smoothing=1.0, modelType="multinomial")
model = nb.fit(training)
predictions_nb = model.transform(test)

In [36]:
accuracy_nb = evaluator.evaluate(predictions_nb)
print("Accuracy = " + str(accuracy_nb))

Accuracy = 0.4364714887102947


### KMeans

In [None]:
from pyspark.ml.clustering import KMeans

kmeans = KMeans().setK(3).setSeed(1)
model = kmeans.fit(training)
predictions_km = model.transform(test)

In [None]:
from pyspark.ml.evaluation import ClusteringEvaluator

evaluator = ClusteringEvaluator()
silhouette = evaluator.evaluate(predictions_km)
print("Silhouette with squared euclidean distance = " + str(silhouette))

# Shows the result.
centers = model.clusterCenters()
print("Cluster Centers: ")
for center in centers:
    print(center)

## K-fold cross validation