## Twitter Gender Classification
### Final Project

#### University of California, Santa Barbara
#### PSTAT 135: Big Data Analytics

Source: https://www.kaggle.com/crowdflower/twitter-user-gender-classification

### Dataset Preprocessing

The dataset contains about 20,000 rows, each with a user name, a random tweet, account profile and image, location, and link and sidebar color. All tweets were posted on October 26, 2015.

In [1]:
from pyspark.sql import SparkSession
spark = SparkSession \
        .builder \
        .appName("comm") \
        .getOrCreate()

tweets = spark.read.csv('gender_data.csv', header = True)

In [2]:
type(tweets)

pyspark.sql.dataframe.DataFrame

In [3]:
tweets.count()

24230

In [4]:
tweets.columns

['_unit_id',
 '_golden',
 '_unit_state',
 '_trusted_judgments',
 '_last_judgment_at',
 'gender',
 'gender:confidence',
 'profile_yn',
 'profile_yn:confidence',
 'created',
 'description',
 'fav_number',
 'gender_gold',
 'link_color',
 'name',
 'profile_yn_gold',
 'profileimage',
 'retweet_count',
 'sidebar_color',
 'text',
 'tweet_coord',
 'tweet_count',
 'tweet_created',
 'tweet_id',
 'tweet_location',
 'user_timezone']

**We removed columns `_golden`, `_unit_state`, `_last_judgment_at`, `gender:confidence`, `profile_yn`, `profile_yn:confidence`, `link_color`, `profile_yn_gold`, `profileimage`, and `sidebar_color` because they were not relevant in our model/purpose.**

In [5]:
columns_to_drop = ['_unit_id','_golden','_unit_state','_trusted_judgments','_last_judgment_at',
                   'gender:confidence','profile_yn','profile_yn:confidence','gender_gold',
                   'profile_yn_gold','link_color','profileimage','sidebar_color']
tweets = tweets.drop(*columns_to_drop)
tweets.columns

['gender',
 'created',
 'description',
 'fav_number',
 'name',
 'retweet_count',
 'text',
 'tweet_coord',
 'tweet_count',
 'tweet_created',
 'tweet_id',
 'tweet_location',
 'user_timezone']

**In the `gender` column, we found that there were other labels besides `male`, `female`, and `brand`. Therefore, we are only going to keep rows where they have 1 of these 3 labels.**

In [6]:
tweets = tweets.filter((tweets.gender == 'male') | (tweets.gender == 'female') | (tweets.gender == 'brand'))

In [7]:
tweets.count()

18836

**From above, we see that after filtering gender, we have 18,836 rows. Next, we will check to see that each row has non-empty or non-null values in column `text`.**

In [8]:
# Number of non-null values in "text"
tweets.filter(tweets.text.isNotNull()).count()

17748

We see that we have 17,748 non-null values, so there is a presence of null values in this column. Thus, we will remove rows with null values, since actual text is important in our model.

In [9]:
tweets = tweets.filter(tweets.text.isNotNull())

**For columns `fav_number`, `retweet_count`, and `tweet_count`, we will check to see if there are null values.** 

- If more than 10% of the values are null, we will not include that column later on in our model. 

- If less than 10% of the values are null, we will replace null values with the median of that specific column. We chose median because there is a possibility the mean could be something like 77.5, and you cannot retweet something 77.5 times.

`fav_number` : Since there are no null values, there is no need to do any replacing.

In [10]:
# Number of null values in "fav_number"
tweets.filter(tweets.fav_number.isNull()).count()

0

`retweet_count` : Since there are no null values, there is no need to do any replacing. 

In [11]:
# Number of null values in "retweet_count"
tweets.filter(tweets.retweet_count.isNull()).count()

0

`tweet_count` : There appears to be 1,259 null values. Since null values make up less than 10% of this column, we will replace these null values with the median of this column.

In [12]:
# Number of null values in "tweet_count"
tweets.filter(tweets.tweet_count.isNull()).count()

1259

We applied **Imputer** because it imputes missing values using mean (the default) or median in columns where missing values are located.

In [13]:
# Median of "tweet_count"
from pyspark.ml.feature import Imputer
from pyspark.sql.types import IntegerType

tweets = tweets.withColumn("tweet_count",tweets["tweet_count"].cast(IntegerType()))
imputer = Imputer(inputCols=["tweet_count"], outputCols=["tweet_count_new"]).setStrategy("median")
tweets = imputer.fit(tweets).transform(tweets)

Now we will check to see if there are any null values left in the column.

In [14]:
tweets.filter(tweets.tweet_count_new.isNull()).count()

0

**We are interested in getting the number of years the twitter user has had there account, up to the date of the posted tweet we have in our data. We will create a row with this count of years and call it `account_years`.**

In [15]:
tweets.select('created').show(3)

+--------------+
|       created|
+--------------+
|  12/5/13 1:48|
| 10/1/12 13:51|
|11/28/14 11:30|
+--------------+
only showing top 3 rows



In [16]:
# Use rdd, extract only the "year" part of the date, subtract from "15" (representing 2015)
rdd = tweets.rdd
years = rdd.map(lambda row: row['created'].split(' ')) \
                .map(lambda x: x[0]) \
                .map(lambda x: x.split('/')) \
                .map(lambda x: x[2]) \
                .map(lambda x: (15-int(x))) \
                .map(lambda x: (x, )).toDF()
years.show(3)

+---+
| _1|
+---+
|  2|
|  3|
|  1|
+---+
only showing top 3 rows



In [17]:
# Convert back to dataframe, add calculated years to "tweets" as "account_years"
tweets = tweets.join(years).withColumnRenamed('_1', 'account_years')

### Exploratory Data Analysis (EDA)

By Gender: top 10 most used words, average number of favorites (`fav_number`), average number of retweets (`retween_count`), average number of tweets (`tweet_count`), average number of account years


# CHECK ACCOUNT YEARS AND TOP 10

**Female**

In [18]:
# Top 10 most used words
#female = tweets.filter(tweets.gender == "female")
#words = female.rdd.flatMap(lambda row: row['text'].split(' '))
#wordcounts = words.map(lambda x: (x, 1)) \
                  #.reduceByKey(lambda x,y:x+y) \
                  #.map(lambda x:(x[1],x[0])) \
                  #.sortByKey(False)
#wordcounts.take(10)

In [19]:
# fav_number average
tweets.filter(tweets.gender == "female").agg({'fav_number':'avg'}).show()

+-----------------+
|  avg(fav_number)|
+-----------------+
|6039.206956248033|
+-----------------+



In [20]:
# retweet_count average
tweets.filter(tweets.gender == "female").agg({'retweet_count':'avg'}).show()

+-------------------+
| avg(retweet_count)|
+-------------------+
|0.04737173434057287|
+-------------------+



In [21]:
# tweet_count average
tweets.filter(tweets.gender == "female").agg({'tweet_count':'avg'}).show()

+-----------------+
| avg(tweet_count)|
+-----------------+
|26868.96356206368|
+-----------------+



In [32]:
# account_years average
tweets.filter(tweets.gender == "female").agg({'account_years':'avg'}).show()

+------------------+
|avg(account_years)|
+------------------+
|2.8274735181428894|
+------------------+



**Male**

In [None]:
# Top 10 most used words
#female = tweets.filter(tweets.gender == "female")
#words = female.rdd.flatMap(lambda row: row['text'].split(' '))
#wordcounts = words.map(lambda x: (x, 1)) \
                  #.reduceByKey(lambda x,y:x+y) \
                  #.map(lambda x:(x[1],x[0])) \
                  #.sortByKey(False)
#wordcounts.take(10)

In [26]:
# fav_number average
tweets.filter(tweets.gender == "male").agg({'fav_number':'avg'}).show()

+-----------------+
|  avg(fav_number)|
+-----------------+
|4929.307825484764|
+-----------------+



In [27]:
# retweet_count average
tweets.filter(tweets.gender == "male").agg({'retweet_count':'avg'}).show()

+-------------------+
| avg(retweet_count)|
+-------------------+
|0.09279778393351801|
+-------------------+



In [28]:
# tweet_count average
tweets.filter(tweets.gender == "male").agg({'tweet_count':'avg'}).show()

+-----------------+
| avg(tweet_count)|
+-----------------+
|30534.52625698324|
+-----------------+



In [31]:
# account_years average
tweets.filter(tweets.gender == "male").agg({'account_years':'avg'}).show()

+------------------+
|avg(account_years)|
+------------------+
|2.8274735181428894|
+------------------+



### Natural Language Processing (NLP)

#### We derived features based on casing, punctuation, and emojis on the tweets in the `text` column.

**Casing**

In [22]:
# Counts of Lowercase (True), Uppercase (False)

words = rdd.flatMap(lambda row: row['text'].split(' '))
wordcounts = words.map(lambda x: (x, 0)) \
                    .map(lambda x: x[0].isupper())
wordcounts.countByValue()

defaultdict(int, {False: 252068, True: 12995})

In [23]:
12995/(12995+252068)

0.04902608059216111

There are 12,995 upper-case words in the `text` column. As you can see above, upper-case words make up less than 5% of the total words, so we do not think it is a heavily-weighted factor in deciding gender.

**Punctuation**

In [24]:
# Count punctuation (!, ?, .)
from string import punctuation
from collections import Counter

counts = Counter(str(tweets.select('text').take(17748)))
punctuation_counts = {k:v for k, v in counts.items() if k in punctuation}
punctuation_counts

{'[': 96,
 '(': 18409,
 '=': 17774,
 "'": 32906,
 '#': 5809,
 ':': 11900,
 '/': 21656,
 '.': 21515,
 ')': 18566,
 ',': 22674,
 '\\': 430,
 '"': 13310,
 '@': 7858,
 '-': 2266,
 '?': 1707,
 '!': 3546,
 '+': 129,
 '_': 7190,
 '%': 106,
 '&': 679,
 ';': 811,
 ']': 95,
 '|': 132,
 '$': 205,
 '~': 52,
 '*': 291,
 '^': 26,
 '{': 11,
 '}': 8,
 '`': 5}

**Emojis**

In [25]:
# Count emojis
import emoji

emojis_list = tweets.rdd.map(lambda x: ''.join(x.split()), emoji.UNICODE_EMOJI.keys()).collect()

counts = Counter(str(tweets.select('text').take(17748)))
emoji_counts = {k:v for k, v in counts.items() if k in emojis_list}

ModuleNotFoundError: No module named 'emoji'

In [None]:
import unicodedata
[w for w in tweets.select('text').take(1) if any(c for c in w if unicodedata.category(c) == 'So')]

#### Now we will process the tweets. We can monitor our changes by looking at row 1 of `text`

**Casing**

In [None]:
tweets.select("text").show(1, truncate=False)

In [None]:
# Lowercase words
from pyspark.sql.functions import lower, col
tweets = tweets.withColumn("text",lower(col('text')))
tweets.select("text").show(1, truncate=False)

**Tokenize**

In [None]:
# Tokenizer
from pyspark.ml.feature import Tokenizer
tokenizer = Tokenizer(inputCol="text", outputCol="words")
tweets = tokenizer.transform(tweets)
tweets.select("words").show(1, truncate=False)

**Stop Word Removal**

In [None]:
# Stop Word Removal
from pyspark.ml.feature import StopWordsRemover
remover = StopWordsRemover(inputCol="words", outputCol="filtered")
tweets = remover.transform(tweets)
tweets.select("filtered").show(1, truncate=False)

**Lemmatization**

In [None]:
from pyspark.ml.feature import HashingTF, IDF, StringIndexer, SQLTransformer,IndexToString
from sparknlp.base import *
from sparknlp.annotator import *
from sparknlp.pretrained import PretrainedPipeline
import sparknlp

stemmer = Stemmer() \
    .setInputCols(["cleanTokens"]) \
    .setOutputCol("stem")