## Twitter Gender Classification
### Final Project

#### University of California, Santa Barbara
#### PSTAT 135: Big Data Analytics

Source: https://www.kaggle.com/crowdflower/twitter-user-gender-classification

### Dataset Preprocessing

The dataset contains about 20,000 rows, each with a user name, a random tweet, account profile and image, location, and link and sidebar color. All tweets were posted on October 26, 2015.

In [1]:
from pyspark.sql import SparkSession
spark = SparkSession \
        .builder \
        .appName("comm") \
        .getOrCreate()

tweets = spark.read.csv('gender_data.csv', header = True)

In [2]:
type(tweets)

pyspark.sql.dataframe.DataFrame

In [3]:
tweets.count()

24230

In [4]:
tweets.columns

['_unit_id',
 '_golden',
 '_unit_state',
 '_trusted_judgments',
 '_last_judgment_at',
 'gender',
 'gender:confidence',
 'profile_yn',
 'profile_yn:confidence',
 'created',
 'description',
 'fav_number',
 'gender_gold',
 'link_color',
 'name',
 'profile_yn_gold',
 'profileimage',
 'retweet_count',
 'sidebar_color',
 'text',
 'tweet_coord',
 'tweet_count',
 'tweet_created',
 'tweet_id',
 'tweet_location',
 'user_timezone']

**We removed columns `_golden`, `_unit_state`, `_last_judgment_at`, `gender:confidence`, `profile_yn`, `profile_yn:confidence`, `link_color`, `profile_yn_gold`, `profileimage`, and `sidebar_color` because they were not relevant in our model/purpose.**

In [5]:
columns_to_drop = ['_unit_id','_golden','_unit_state','_trusted_judgments','_last_judgment_at',
                   'gender:confidence','profile_yn','profile_yn:confidence','gender_gold',
                   'profile_yn_gold','link_color','profileimage','sidebar_color']
tweets = tweets.drop(*columns_to_drop)
tweets.columns

['gender',
 'created',
 'description',
 'fav_number',
 'name',
 'retweet_count',
 'text',
 'tweet_coord',
 'tweet_count',
 'tweet_created',
 'tweet_id',
 'tweet_location',
 'user_timezone']

**In the `gender` column, we found that there were other labels besides `male`, `female`, and `brand`. Therefore, we are only going to keep rows where they have 1 of these 3 labels.**

In [6]:
tweets = tweets.filter((tweets.gender == 'male') | (tweets.gender == 'female') | (tweets.gender == 'brand'))

In [7]:
tweets.count()

18836

**From above, we see that after filtering gender, we have 18,836 rows. Next, we will check to see that each row has non-empty or non-null values in column `text`.**

In [8]:
# Number of non-null values in "text"
tweets.filter(tweets.text.isNotNull()).count()

17748

We see that we have 17,748 non-null values, so there is a presence of null values in this column. Thus, we will remove rows with null values, since actual text is important in our model.

In [9]:
tweets = tweets.filter(tweets.text.isNotNull())

**For columns `fav_number`, `retweet_count`, and `tweet_count`, we will check to see if there are null values.** 

- If more than 10% of the values are null, we will not include that column later on in our model. 

- If less than 10% of the values are null, we will replace null values with the median of that specific column. We chose median because there is a possibility the mean could be something like 77.5, and you cannot retweet something 77.5 times.

`fav_number` : Since there are no null values, there is no need to do any replacing.

In [10]:
# Number of null values in "fav_number"
tweets.filter(tweets.fav_number.isNull()).count()

0

`retweet_count` : Since there are no null values, there is no need to do any replacing. 

In [11]:
# Number of null values in "retweet_count"
tweets.filter(tweets.retweet_count.isNull()).count()

0

`tweet_count` : There appears to be 1,259 null values. Since null values make up less than 10% of this column, we will replace these null values with the median of this column.

In [12]:
# Number of null values in "tweet_count"
tweets.filter(tweets.tweet_count.isNull()).count()

1259

We applied **Imputer** because it imputes missing values using mean (the default) or median in columns where missing values are located.

In [13]:
# Median of "tweet_count"
from pyspark.ml.feature import Imputer
from pyspark.sql.types import IntegerType

tweets = tweets.withColumn("tweet_count",tweets["tweet_count"].cast(IntegerType()))
imputer = Imputer(inputCols=["tweet_count"], outputCols=["tweet_count_new"]).setStrategy("median")
tweets = imputer.fit(tweets).transform(tweets)

Now we will check to see if there are any null values left in the column.

In [14]:
tweets.filter(tweets.tweet_count_new.isNull()).count()

0

**We are interested in getting the number of years the twitter user has had there account, up to the date of the posted tweet we have in our data. We will create a row with this count of years and call it `account_years`.**

In [15]:
tweets.select('created').show(3)

+--------------+
|       created|
+--------------+
|  12/5/13 1:48|
| 10/1/12 13:51|
|11/28/14 11:30|
+--------------+
only showing top 3 rows



In [16]:
# Use rdd, extract only the "year" part of the date, subtract from "15" (representing 2015)
rdd = tweets.rdd
years = rdd.map(lambda row: row['created'].split(' ')) \
                .map(lambda x: x[0]) \
                .map(lambda x: x.split('/')) \
                .map(lambda x: x[2]) \
                .map(lambda x: (15-int(x))) \
                .map(lambda x: (x, )).toDF()
years.show(3)

+---+
| _1|
+---+
|  2|
|  3|
|  1|
+---+
only showing top 3 rows



In [17]:
# Convert back to dataframe, add calculated years to "tweets" as "account_years"
tweets = tweets.join(years).withColumnRenamed('_1', 'account_years')

### Exploratory Data Analysis (EDA)

Checking the counts and frequencies, average number of favorites (`fav_number`), average number of retweets (`retween_count`), average number of tweets (`tweet_count`), and average number of account years for each `gender` output.

In [24]:
# counts
tweets.groupBy('gender').count().orderBy('count',ascending=False).show()

+------+---------+
|gender|    count|
+------+---------+
|female|112770792|
|  male|102512448|
| brand| 99708264|
+------+---------+



In [25]:
# brand averages
tweets.filter(tweets.gender == "brand") \
            .agg({'fav_number':'avg','retweet_count':'avg','tweet_count':'avg', 'account_years':'avg'}).show()

+----------------+------------------+-----------------+-------------------+
|avg(tweet_count)|avg(account_years)|  avg(fav_number)| avg(retweet_count)|
+----------------+------------------+-----------------+-------------------+
|63219.3443287037|2.8274735181428894|2086.142043431826|0.11765752936988252|
+----------------+------------------+-----------------+-------------------+



In [26]:
# female averages
tweets.filter(tweets.gender == "female") \
            .agg({'fav_number':'avg','retweet_count':'avg','tweet_count':'avg', 'account_years':'avg'}).show()

+-----------------+------------------+-----------------+-------------------+
| avg(tweet_count)|avg(account_years)|  avg(fav_number)| avg(retweet_count)|
+-----------------+------------------+-----------------+-------------------+
|26868.96356206368|2.8274735181428894|6039.206956248033|0.04737173434057287|
+-----------------+------------------+-----------------+-------------------+



In [27]:
# male averages
tweets.filter(tweets.gender == "male") \
            .agg({'fav_number':'avg','retweet_count':'avg','tweet_count':'avg', 'account_years':'avg'}).show()

+-----------------+------------------+-----------------+-------------------+
| avg(tweet_count)|avg(account_years)|  avg(fav_number)| avg(retweet_count)|
+-----------------+------------------+-----------------+-------------------+
|30534.52625698324|2.8274735181428894|4929.307825484764|0.09279778393351801|
+-----------------+------------------+-----------------+-------------------+



### Natural Language Processing (NLP)

#### We derived features based on casing, punctuation, and emojis on the tweets in the `text` column.

**Casing**

In [18]:
# Counts of Lowercase (True), Uppercase (False)

words = rdd.flatMap(lambda row: row['text'].split(' '))
wordcounts = words.map(lambda x: (x, 0)) \
                    .map(lambda x: x[0].isupper())
wordcounts.countByValue()

defaultdict(int, {False: 252068, True: 12995})

In [19]:
12995/(12995+252068)

0.04902608059216111

There are 12,995 upper-case words in the `text` column. As you can see above, upper-case words make up less than 5% of the total words, so we do not think it is a heavily-weighted factor in deciding gender.

**Punctuation**

In [20]:
# Count punctuation (!, ?, .)
from string import punctuation
from collections import Counter

counts = Counter(str(tweets.select('text').take(17748)))
punctuation_counts = {k:v for k, v in counts.items() if k in punctuation}
punctuation_counts

{'[': 96,
 '(': 18409,
 '=': 17774,
 "'": 32906,
 '#': 5809,
 ':': 11900,
 '/': 21656,
 '.': 21515,
 ')': 18566,
 ',': 22674,
 '\\': 430,
 '"': 13310,
 '@': 7858,
 '-': 2266,
 '?': 1707,
 '!': 3546,
 '+': 129,
 '_': 7190,
 '%': 106,
 '&': 679,
 ';': 811,
 ']': 95,
 '|': 132,
 '$': 205,
 '~': 52,
 '*': 291,
 '^': 26,
 '{': 11,
 '}': 8,
 '`': 5}

**Emojis**

In [None]:
# Count emojis
import re

tweets.createOrReplaceTempView("table1")
df2 = spark.sql("SELECT COUNT(text) FROM table1 WHERE text LIKE UNICODE(N'FFFD') ")
df2.show(5) 

In [None]:
import unicodedata
[w for w in tweets.select('text').take(1) if any(c for c in w if unicodedata.category(c) == 'So')]

#### Now we will process the tweets. We can monitor our changes by looking at row 1 of `text`

**Casing**

In [21]:
tweets.select("text").show(1, truncate=False)

+-------------------------------------------------------------------------------------------------------------+
|text                                                                                                         |
+-------------------------------------------------------------------------------------------------------------+
|Robbie E Responds To Critics After Win Against Eddie Edwards In The #WorldTitleSeries https://t.co/NSybBmVjKZ|
+-------------------------------------------------------------------------------------------------------------+
only showing top 1 row



In [22]:
# Lowercase words
from pyspark.sql.functions import lower, col
tweets = tweets.withColumn("text",lower(col('text')))
tweets.select("text").show(1, truncate=False)

+-------------------------------------------------------------------------------------------------------------+
|text                                                                                                         |
+-------------------------------------------------------------------------------------------------------------+
|robbie e responds to critics after win against eddie edwards in the #worldtitleseries https://t.co/nsybbmvjkz|
+-------------------------------------------------------------------------------------------------------------+
only showing top 1 row



**Tokenize**

In [23]:
# Tokenizer
from pyspark.ml.feature import Tokenizer
tokenizer = Tokenizer(inputCol="text", outputCol="words")
tweets = tokenizer.transform(tweets)
tweets.select("words").show(1, truncate=False)

+----------------------------------------------------------------------------------------------------------------------------+
|words                                                                                                                       |
+----------------------------------------------------------------------------------------------------------------------------+
|[robbie, e, responds, to, critics, after, win, against, eddie, edwards, in, the, #worldtitleseries, https://t.co/nsybbmvjkz]|
+----------------------------------------------------------------------------------------------------------------------------+
only showing top 1 row



**Stop Word Removal**

In [24]:
# Stop Word Removal
from pyspark.ml.feature import StopWordsRemover
remover = StopWordsRemover(inputCol="words", outputCol="filtered")
tweets = remover.transform(tweets)
tweets.select("filtered").show(1, truncate=False)

+-----------------------------------------------------------------------------------------------+
|filtered                                                                                       |
+-----------------------------------------------------------------------------------------------+
|[robbie, e, responds, critics, win, eddie, edwards, #worldtitleseries, https://t.co/nsybbmvjkz]|
+-----------------------------------------------------------------------------------------------+
only showing top 1 row



**Stemming**

In [25]:
from pyspark.sql.functions import udf
from nltk.stem.snowball import SnowballStemmer
from pyspark.sql.types import ArrayType, StringType

stemmer = SnowballStemmer(language='english')
stemmer_udf = udf(lambda tokens: [stemmer.stem(token) for token in tokens], ArrayType(StringType()))
stem_tweets = tweets.withColumn("stemmed", stemmer_udf("filtered")).select("stemmed")

In [26]:
stem_tweets.columns

['stemmed']

#### Now we will apply Vector Assembler to prep the data for various ML models.

##### [[Counts for each word], [Bigrams], [Punctuation counts, ** Emoji Counts **], tweets counts, favorited tweets, retweet counts, account years]

##### HashingTF for stemmed tweet

In [27]:
from pyspark.mllib.feature import HashingTF
tf = HashingTF(numFeatures=10000)
stem_features = stem_tweets.rdd.map(lambda tweet: tf.transform(tweet[0]))

In [28]:
stem_features.take(2)

[SparseVector(10000, {162: 1.0, 1521: 1.0, 1785: 1.0, 4032: 1.0, 5505: 1.0, 5993: 1.0, 6493: 1.0, 7437: 1.0, 9022: 1.0}),
 SparseVector(10000, {213: 1.0, 1065: 1.0, 1074: 1.0, 1136: 1.0, 2547: 1.0, 4623: 1.0, 6992: 1.0, 7449: 1.0, 8194: 1.0, 8854: 1.0, 9546: 1.0})]

##### Adding columns for counts of punctuation

In [52]:
def punc_func(a):
    if a in punctuation:
        return true
    return false

punc = tweets.select("text").rdd.map(lambda tweet: Counter(str(tweet)))
punc_2 = punc.map(lambda x: list(x.keys())).map(lambda y: 1 if y[0] in punctuation else None)
punc_2.take(5)

[None, None, None, None, None]

In [None]:
         #.map(lambda y: y)).take(5)
#punc_2 = punc.map(lambda x: list(x.keys())).map(lambda x: filter(punc_func, item) for item in x) 
         #lambda y: 1 if y in punctuation else None)
#punc_2.take(5)

In [None]:
from pyspark.ml.linalg import Vectors
from pyspark.ml.feature import VectorAssembler

#assembler = VectorAssembler(
#   inputCols=["text"],
#    outputCol="features")

#output = assembler.transform(tweets)
#output.select("features", "text").show(truncate=False)