### DS 5110 Final Project
#### Michael Kolonay (mhk9c)
#### Tyler Entner (tje6gt)


#### Dataset Variable Descriptions:

| Variable Name      | Description |
| ----------- | ----------- |
|**external_author_id**|	An author account ID from Twitter|
|**author**	|The handle sending the tweet
|**content**	|The text of the tweet
|**region**|	A region classification, as determined by Social Studio
|**language**|	The language of the tweet
|**publish_date**|	The date and time the tweet was sent
|**harvested_date**|	The date and time the tweet was collected by Social Studio
|**following**|	The number of accounts the handle was following at the time of the tweet
|**followers**|	The number of followers the handle had at the time of the tweet
|**updates**|	The number of “update actions” on the account that authored the tweet, including tweets, retweets and likes
|**post_type**|	Indicates if the tweet was a retweet or a quote-tweet
|**account_type**|	Specific account theme, as coded by Linvill and Warren
|**retweet**|	A binary indicator of whether or not the tweet is a retweet
|**account_category**|	General account theme, as coded by Linvill and Warren
|**new_june_2018**|	A binary indicator of whether the handle was newly listed in June 2018
|**alt_external_id**|	Reconstruction of author account ID from Twitter, derived from article_url variable and the first list provided to Congress
|**tweet_id**|	Unique id assigned by twitter to each status update, derived from article_url
|**article_url**|	Link to original tweet. Now redirects to "Account Suspended" page
|**tco1_step1**|	First redirect for the first http(s)://t.co/ link in a tweet, if it exists
|**tco2_step1**|	First redirect for the second http(s)://t.co/ link in a tweet, if it exists
|**tco3_step1**|	First redirect for the third http(s)://t.co/ link in a tweet, if it exists

## SECTION 1: Data Import and Preprocessing
#### Data Ingestion

First, we'll load all of the datasets into dataframes. Each dataset will have the same schema, so we can load multiple datasets with one call.

In [3]:
# import context manager: SparkSession
from pyspark.sql import SparkSession

# import data types
from pyspark.sql.types import StructType, StructField, StringType, IntegerType

#Create session with custom app name, grab context
spark = SparkSession \
        .builder \
        .master("local") \
        .appName("team1_sp22_final_project") \
        .config("spark.executor.memory", '100g') \
        .config('spark.executor.cores', '20') \
        .config('spark.cores.max', '20') \
        .config("spark.driver.memory",'100g') \
        .getOrCreate()

sc = spark.sparkContext

In [2]:
# Set data directory
data_directory = "data/"

# Define custom schema of csv files
schema = StructType([StructField('external_author_id', StringType(), True), 
                     StructField('author', StringType(), True),
                     StructField('content', StringType(), True),
                     StructField('region', StringType(), True),
                     StructField('language', StringType(), True),
                     StructField('publish_date', StringType(), True),
                     StructField('harvested_date', StringType(), True),
                     StructField('following', IntegerType(), True),
                     StructField('followers', IntegerType(), True),
                     StructField('updates', IntegerType(), True),
                     StructField('post_type', StringType(), True),
                     StructField('account_type', StringType(), True),
                     StructField('retweet', IntegerType(), True),
                     StructField('account_category', StringType(), True),
                     StructField('new_june_2018', IntegerType(), True),
                     StructField('alt_external_id', StringType(), True),
                     StructField('tweet_id', StringType(), True),
                     StructField('article_url', StringType(), True),
                     StructField('tco1_step1', StringType(), True),
                     StructField('tco2_step1', StringType(), True),
                     StructField('tco3_step1', StringType(), True)                    
                    ])

# Create df by loading in all csv files in data_directory with schema
df = spark.read.format("csv") \
    .option("header", "true") \
    .option("sep",",") \
    .schema(schema) \
    .load(data_directory)

In [3]:
# Filter for english content only
df_english = df.filter(df['language']=='English')
df_english.take(2) 

[Row(external_author_id='1647045721', author='CARRIETHORNTHON', content='New Study Reveals Liberals Have A Lower Average IQ Than Conservatives http://t.co/B82NFpFSf6 WE ON TWITTER KNEW THIS ALREADY.', region='United States', language='English', publish_date='6/1/2015 22:04', harvested_date='6/1/2015 22:04', following=80, followers=207, updates=1193, post_type='RETWEET', account_type='Right', retweet=1, account_category='RightTroll', new_june_2018=0, alt_external_id='1647045721', tweet_id='605495107186364419', article_url='http://twitter.com/CarrieThornthon/statuses/605495107186364419', tco1_step1='http://gopthedailydose.com/2015/06/01/new-study-reveals-liberals-have-a-lower-average-iq-than-conservatives/', tco2_step1=None, tco3_step1=None),
 Row(external_author_id='1647045721', author='CARRIETHORNTHON', content='Lindsey Graham has an entirely reasonable position on climate change, sometimes http://t.co/kZMZka7Ja7 http://t.co/55juNRjwAT', region='United States', language='English', publ

In [4]:
# Some information on the dataset:
print(df.count(), len(df.columns))
print(df_english.count(), len(df_english.columns))
df = df_english #replace df with english df

2946207 21
2096049 21


#### Data Cleaning
Now that the data has been ingested into a single dataframe, we can begin to examine a couple fields and determine if any cleaning needs to be done. 

Ideas:
- Extract URL to seperate column, replace with <url>
- Replace emojis with <emoji>, count number into seperate column
- Take publish date and extract hour, day, month, year
- Columns for length of content in words, characters

In [9]:
# Create features
import emoji
import re
import datetime
from pyspark.sql.functions import col
from pyspark.sql import functions as F
from pyspark.mllib.stat import Statistics
import pyspark.sql.functions as func
from pyspark.sql.types import StringType, ArrayType, IntegerType
import re
rx_b = re.compile(r"@[a-zA-Z0-9]+")
rx_url = re.compile(r"(?:http|ftp|https):\/\/(?:[\w_-]+(?:(?:\.[\w_-]+)+))(?:[\w.,@?^=%&:\/~+#-]*[\w@?^=%&\/~+#-])")


# *************************************************************
def convert_emojii(string): 
    '''
    convert emoji to string representation with demoji
    '''
    try:
        return demoji.replace_with_desc(string, ":")
    except:
        return "COULD NOT CONVERT EMOJII"
convert_emojii_UDF = func.udf(lambda z:convert_emojii(z),StringType())   
# test = convert_emojii("🐝🐝🐝")   
# print(test)

# *************************************************************
def extract_domain_information(url):  
    '''
    Extract domain information with tldextract
        Attempts to get registered domain if not parses out domain information from url
    '''
    try:        
        if(url):            
            ext = tldextract.extract(url)
            if(ext.registered_domain):                
                return ext.registered_domain
            else :                
                return f'{ext.subdomain}.{ext.domain}.{ext.suffix}'                
        else:            
            return "NA"        
    except Exception as e:        
        return "NA"    
extract_domain_information_UDF = func.udf(lambda z:extract_domain_information(z),StringType())   

# *************************************************************
def extract_handles(content): 
    '''
        gets all the handles in the tweet of the form @[a-zA-Z0-9]+ and returns an array
    '''
    try:
        if(content is not None):        
            result = re.findall(rx_b, content) 
            return result
        else:
            return []
    except:
        return []    
extract_handles_UDF = func.udf(lambda z:extract_handles(z),ArrayType(StringType(), True))   
# test = extract_handles("Hi @MichelleObama , remember when you praised Harvey Weinstein as 'a wonderful human being, a good friend and a powerhouse.")
# print(test)

# *************************************************************
def count_emoji(string):
    '''
    Count number of emojis within a string
    '''
    if string:
        return emoji.emoji_count(string)
    else:
        return 0
count_emoji_udf = func.udf(lambda x: count_emoji(x), IntegerType())

# *************************************************************
def extract_emoji(string):
    '''
    Extract emojis by converting them to text
    '''
    if string:
        return emoji.demojize(emoji.distinct_emoji_lis(string))
    else:
        return 'None'
extract_emoji_udf = func.udf(lambda x: extract_emoji(x), StringType())

# *************************************************************
def extract_urls(string):
    '''
    Extract all urls in string
    '''
    if string:
#         urls = re.findall('(?:(?:https?|ftp):\\/\\/)?[\\w/\\-?=%.]+\\.[\\w/\\-&?=%.]+', string)
        urls = re.findall(rx_url, string)

        return urls
    else:
        return 'None'
extract_urls_udf = func.udf(lambda x: extract_urls(x), StringType())

# *************************************************************
def url_count(string):
    '''
    Count all urls in string
    '''
    if string:
        return(len(extract_urls(string)))
    else:
        return 0
url_count_udf = func.udf(lambda x: url_count(x), IntegerType())

# *************************************************************
def extract_url_parts(string):
    '''
    Return url in parts (https://stackoverflow.com/questions/27745/getting-parts-of-a-url-regex)
    '''
    if string:
        return re.findall('^((http[s]?|ftp):\/)?\/?([^:\/\s]+)((\/\w+)*\/)([\w\-\.]+[^#?\s]+)(.*)?(#[\w\-]+)?$', string)
    else:
        return 'None'

def extract_urls_redirect_base(string_1, string_2, string_3):
    '''
    Call extract_url_parts and create a list of hosts from twitters redirect columns
    '''
    try:
        host_list = ['', '', '']
        if string_3:
            url_parts = extract_url_parts(string_3)
            host_list[2] = url_parts[0][2]
        if string_2:
            url_parts = extract_url_parts(string_2)
            host_list[1] = url_parts[0][2]
        if string_1:
            url_parts = extract_url_parts(string_1)
            host_list[0] = url_parts[0][2]
        else:
            return 'None'
    except:
        return 'None'
    return host_list
extract_urls_redirect_base_udf = func.udf(lambda x,y,z: extract_urls_redirect_base(x,y,z), StringType())

# *************************************************************
def word_count(string):
    '''
    Count number of words in string (slightly error prone b/c split on spaces)
    '''
    if string:
        return len(string.split(' '))
    else:
        return 0
word_count_udf = func.udf(lambda x: word_count(x), IntegerType())

# *************************************************************
def character_count(string):
    '''
    Count number of characters in the tweet
    '''
    if string:
        return len(string)
    else: 
        return 0
character_count_udf = func.udf(lambda x: character_count(x), IntegerType())

# *************************************************************
def extract_date_info(string, info_type):
    '''
    IN WORK
    Extract date info
    '''
    date = datetime.datetime.strptime(string, '%m/%d/%Y %H:%M')
    
    if info_type == 'minute':
        info = date.minute
    elif info_type == 'hour':
        info = date.hour
    elif info_type == 'day':
        info = date.day
    elif info_type == 'month':
        info = date.month
    elif info_type == 'year':
        info = date.year    
    return info
extract_date_info_udf = func.udf(lambda x,y: extract_date_info(x,y), IntegerType())

# *************************************************************

def assignLabel(account_category):
    '''
        Assigns 1 - troll, or 0 - not-troll as a label to the tweet.
    '''
    if account_category in ("RightTroll", "LeftTroll" , "Fearmonger"):
        return 1
    else:
        return 0    
# test = assignLabel("Commercial")
# print(test)
assignLabel_udf = func.udf(assignLabel, IntegerType())

In [10]:
# Create dataframe witih all columns from feature extraction
df_enriched = df.withColumn("curated_content", convert_emojii_UDF(col("content"))) \
                .withColumn("tco1_step1_domain", extract_domain_information_UDF(col("tco1_step1"))) \
                .withColumn("tco2_step1_domain", extract_domain_information_UDF(col("tco2_step1"))) \
                .withColumn("tco3_step1_domain", extract_domain_information_UDF(col("tco3_step1"))) \
                .withColumn("handles", extract_handles_UDF(col("content"))) \
                .withColumn('emoji_count', count_emoji_udf(col('content'))) \
                .withColumn('emoji_text', extract_emoji_udf(col('content'))) \
                .withColumn('word_count', word_count_udf(col('content'))) \
                .withColumn('char_count', character_count_udf(col('content'))) \
                .withColumn('urls', extract_urls_udf(col('content'))) \
                .withColumn('url_count', url_count_udf(col('content'))) \
                .withColumn('url_hosts', extract_urls_redirect_base_udf(col('tco1_step1'), col('tco2_step1'), col('tco3_step1'))) \
                .withColumn('label',assignLabel_udf(df['account_category']))

In [11]:
df_enriched.select('publish_date','following','followers', 'emoji_count','word_count', 'char_count', 'url_count').summary().show()

+-------+--------------+-----------------+------------------+--------------------+------------------+------------------+------------------+
|summary|  publish_date|        following|         followers|         emoji_count|        word_count|        char_count|         url_count|
+-------+--------------+-----------------+------------------+--------------------+------------------+------------------+------------------+
|  count|       2096049|          2096049|           2096049|             2096049|           2096049|           2096049|           2096049|
|   mean|          null|4241.562128557109| 7146.343682328037|0.048428257163835385|13.181881721276554| 99.08216840350583| 0.863807096112734|
| stddev|          null| 6231.62797308032|11624.817215193514|  0.5200131955777119| 5.479144213764923|35.353085319885935|0.7578290022168858|
|    min|1/1/2013 16:16|                0|                 0|                   0|                 0|                 0|                 0|
|    25%|          n

In [12]:
df = df_enriched
#df_enriched.write.parquet('troll_tweet_full.parquet')

## SECTION 2: Data Splitting / Sampling

In [None]:
training, testing = df.randomSplit([0.6, 0.4], seed=314)
training_len = training.count()
training_troll_len = training.filter(training['label']==1).count()
testing_len = testing.count()
testing_troll_len = testing.filter(testing['label']==1).count()

print("Training count: {} || Training Troll Count and Ratio: {}, {}".format(training_len, training_troll_len, training_troll_len/training_len))
print("Testing count : {} || Testing Troll Count and Ratio: {}, {}".format(testing_len, testing_troll_len, testing_troll_len/testing_len))

In [None]:
training.write.parquet('tyler_training.parquet')
testing.write.parquet('tyler_testing.parquet')

In [None]:
training = spark.read.parquet('tyler_training.parquet')
testing = spark.read.parquet('tyler_testing.parquet')
training = training.dropna(subset = 'content')
testing = testing.dropna(subset = 'content')

## SECTION 3: Exploratory Data Analysis and Visualizations

In [None]:
# Unique counts of categorical columns:
from pyspark.sql.functions import desc
df.groupBy("region").count().sort(desc('count')).show()

In [None]:
df.groupBy("post_type").count().sort(desc('count')).show()

In [None]:
df.groupBy("account_category").count().sort(desc('count')).show()

#### Data Visualization
Create some sample graphics that detail features within the dataset


In [13]:
import matplotlib.pyplot as plt
visuals = False

In [14]:
if visuals:

    # Visualize Follwers per account per post
    bins, counts = df.select('followers').rdd.flatMap(lambda x: x).histogram([0,10, 100, 1000, 10000, 1000000, 10000000])
    fig = plt.figure()
    ax = fig.add_subplot(2, 1, 1)
    ax.hist(bins[:-1], bins=bins, weights=counts)
    ax.set_xscale('log')
    plt.xlabel('Number of Followers')
    plt.ylabel('Count')
    plt.title('Histogram of Followers per Account (log scale)')
    plt.show()

In [15]:
if visuals:
    # Visualize Word Count
    bins, counts = df.select('word_count').rdd.flatMap(lambda x:x).histogram(10)
    fig = plt.figure()
    ax = fig.add_subplot(2, 1, 1)
    ax.hist(bins[:-1], bins=bins, weights=counts)
    plt.xlabel('Number of Words in Tweet')
    plt.ylabel('Count')
    plt.title('Histogram of Words in Tweet')
    plt.show()

In [16]:
if visuals:
    # Visualize Character Count
    bins, counts = df.select('char_count').rdd.flatMap(lambda x:x).histogram(10)
    fig = plt.figure()
    ax = fig.add_subplot(2, 1, 1)
    ax.hist(bins[:-1], bins=bins, weights=counts)
    plt.xlabel('Number of Characters in Tweet')
    plt.ylabel('Count')
    plt.title('Histogram of Characters in Tweet')
    plt.show()

In [17]:
if visuals:
    # Visualize Number of URLs
    bins, counts = df.select('url_count').rdd.flatMap(lambda x:x).histogram(10)
    fig = plt.figure()
    ax = fig.add_subplot(2, 1, 1)
    ax.hist(bins[:-1], bins=bins, weights=counts)
    plt.xlabel('Number of URLs in Tweet')
    plt.ylabel('Count')
    plt.title('Histogram of URLs in Tweet')
    plt.show()

In [18]:
if visuals:
    # Visualize Number of Emojis
    bins, counts = df.select('emoji_count').rdd.flatMap(lambda x:x).histogram(10)
    fig = plt.figure()
    ax = fig.add_subplot(2, 1, 1)
    ax.hist(bins[:-1], bins=bins, weights=counts)
    plt.xlabel('Number of Emojis in Tweet')
    plt.ylabel('Count')
    plt.title('Histogram of Emojis in Tweet')
    plt.show()

In [1]:
#! pip install wordcloud

Defaulting to user installation because normal site-packages is not writeable
Collecting wordcloud
  Downloading wordcloud-1.8.1-cp37-cp37m-manylinux1_x86_64.whl (366 kB)
[K     |████████████████████████████████| 366 kB 17.6 MB/s eta 0:00:01
Installing collected packages: wordcloud
Successfully installed wordcloud-1.8.1


In [4]:
from pyspark.sql.functions import col
from pyspark.sql import functions as F
from pyspark.mllib.stat import Statistics
import pyspark.sql.functions as func
from pyspark.sql.types import StringType, ArrayType, IntegerType

def clean_tokens(string_array):
    import re
    cleaned_words = ['']
    for word in string_array:
        #if word[0] == '#':
        #    clean = re.sub('[^\P{P}#]+', "", word)
        #else:
        if 'http' in word:
            continue
        else:
            clean = re.sub("[^A-Za-z]+", "", word)
        
        cleaned_words.append(clean)

    
    cleaned_words = [word for word in cleaned_words if word]
    return cleaned_words
clean_tokens_udf = func.udf(lambda x: clean_tokens(x), ArrayType(StringType(), True))


In [None]:
clean_token_words.select('cleaned_words').show(5, truncate = False)

In [None]:
from wordcloud import WordCloud
data_full = training.union(testing)

tok_content = Tokenizer(inputCol="content", outputCol="words")
token_words = tok_content.transform(data_full)
clean_token_words = token_words.withColumn("cleaned_words", clean_tokens_udf(col("words")))

remover_content = StopWordsRemover(inputCol="cleaned_words", outputCol="cleaned_words_filtered")
no_stop_words = remover_content.transform(clean_token_words)

In [None]:
troll_words = no_stop_words.filter(no_stop_words['label'] == 1)
not_troll_words = no_stop_words.filter(no_stop_words['label'] == 0)

In [None]:
troll_only_words = troll_words.select('cleaned_words_filtered')
troll_only_words.show(5, truncate = False)
troll_words_final = troll_only_words.rdd.flatMap(lambda x: x).collect()

In [None]:
not_troll_only_words = not_troll_words.select('cleaned_words_filtered')
not_troll_only_words.show(5, truncate = False)
not_troll_words_final = not_troll_only_words.rdd.flatMap(lambda x: x).collect()

In [None]:
all_troll_words = ' '.join(sent for sent in [' '.join(word for word in tweet) for tweet in troll_words_final])
all_not_troll_words = ' '.join(sent for sent in [' '.join(word for word in tweet) for tweet in not_troll_words_final])

In [None]:
troll_cloud = WordCloud(max_words = 250, background_color = 'white').generate(all_troll_words)
not_troll_cloud = WordCloud(max_words = 250, background_color = 'white').generate(all_not_troll_words)

In [None]:
plt.figure()
plt.imshow(troll_cloud, interpolation="bilinear")
plt.axis("off")
plt.show()

In [None]:
plt.figure()
plt.imshow(not_troll_cloud, interpolation="bilinear")
plt.axis("off")
plt.show()

## SECTION 4: Model Construction

#### Model Building

In [None]:
# import context manager: SparkSession
from pyspark.sql import SparkSession

from pyspark.sql import SparkSession
from pyspark.sql.functions import col,lit
from pyspark.sql import functions as F
from pyspark.mllib.stat import Statistics
from pyspark.sql import DataFrame


from pyspark.ml import Pipeline, PipelineModel, Transformer
from pyspark.ml.feature import *  
from pyspark.ml.classification import LogisticRegression, RandomForestClassifier
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder

from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.mllib.evaluation import MulticlassMetrics
from pyspark.sql.types import FloatType

from pyspark.ml.param.shared import HasInputCol, HasOutputCol, Param, Params, TypeConverters

from pyspark import keyword_only

import time

import matplotlib.pyplot as plt

In [None]:
#Create session with custom app name, grab context
spark = SparkSession \
        .builder \
        .master("local") \
        .appName("team1_sp22_final_project") \
        .config("spark.executor.memory", '150g') \
        .config('spark.executor.cores', '20') \
        .config('spark.cores.max', '20') \
        .config("spark.driver.memory",'150g') \
        .getOrCreate()

sc = spark.sparkContext

In [None]:
training = spark.read.parquet('tyler_training.parquet')
testing = spark.read.parquet('tyler_testing.parquet')
training = training.dropna(subset = 'content')
testing = testing.dropna(subset = 'content')

In [None]:
def binary_evaluation_metrics(prediction):
    evaluator = BinaryClassificationEvaluator(rawPredictionCol = 'prediction', labelCol = 'label', metricName = 'areaUnderROC')
    auROC = evaluator.evaluate(prediction)
    print("AUROC: {}".format(auROC))
    
    predictionsRdd = prediction.select("prediction","label").rdd
    predictionsRdd = predictionsRdd.map(lambda p: (float(p.label), (float(p.prediction))))
    metrics = MulticlassMetrics(predictionsRdd)
    print(f'Accuracy with MulticlassMetrics is {metrics.accuracy}')
    print(f'Precision with MulticlassMetrics is {metrics.precision}')
    print(f'Recall with MulticlassMetrics is {metrics.recall}')
    print(f'F1 Score with MulticlassMetrics is {metrics.fMeasure}')

    print(metrics.confusionMatrix().toArray())
    
def cross_val_train_test(cv_model, training, testing):
    t0 = time.time()
    cvModel = crossval.setParallelism(20).fit(training) # train 20 models in parallel
    print("train time:", time.time() - t0)
    print('-'*30)
    
    t0 = time.time()
    prediction = cvModel.transform(testing)
    print("test time:", time.time() - t0)
    print('-'*30)
    
    return cvModel, prediction

#### Model Subset 1: Logistic Regression

##### Simple Column Evaluation

In [None]:
simple_cols = ['url_count', 'char_count', 'word_count', 'emoji_count', 'following', 'followers']

va = VectorAssembler(inputCols = simple_cols, outputCol = 'features')
lr = LogisticRegression(labelCol = 'label', featuresCol = 'features')

paramGrid = ParamGridBuilder() \
    .addGrid(lr.regParam, [0.1, 0.01]) \
    .addGrid(lr.maxIter, [5, 10]) \
    .build()

simple_cols_pipeline = Pipeline(stages = [va, lr])

crossval = CrossValidator(estimator = simple_cols_pipeline,
                          estimatorParamMaps = paramGrid,
                          evaluator = BinaryClassificationEvaluator(),
                          numFolds=5)

In [None]:
lr_simple_cvModel, lr_simple_prediction = cross_val_train_test(crossval, training, testing)

In [None]:
binary_evaluation_metrics(lr_simple_prediction)

In [None]:
lr_simple_cvModel.save('lr_small_model')

##### Content Columns without URL, Text Only

In [None]:
class TokenCleaner(Transformer):
    """
    A custom Transformer
    """
    inputCol = Param(Params._dummy(), "inputCol", "input column name.", typeConverter=TypeConverters.toString)
    outputCol = Param(Params._dummy(), "outputCol", "output column name.", typeConverter=TypeConverters.toString)
    
    @keyword_only
    def __init__(self, inputCol: str = "input", outputCol: str = "output"):
        super(TokenCleaner, self).__init__()
        self._setDefault(inputCol=None, outputCol=None)
        kwargs = self._input_kwargs
        self.set_params(**kwargs)

    @keyword_only
    def set_params(self, inputCol: str = "input", outputCol: str = "output"):
        kwargs = self._input_kwargs
        self._set(**kwargs)

    def get_input_col(self):
        return self.getOrDefault(self.inputCol)

    def get_output_col(self):
        return self.getOrDefault(self.outputCol)

    def _transform(self, df: DataFrame):
        inputCol = self.get_input_col()
        outputCol = self.get_output_col()

        transform_udf = func.udf(lambda x: clean_tokens(x), ArrayType(StringType(), True))

        return df.withColumn(outputCol, transform_udf(inputCol))


In [None]:
tok_content = Tokenizer(inputCol="content", outputCol="words")

cleaned_token = TokenCleaner(inputCol = 'words', outputCol = 'cleaned_words')

remover_content = StopWordsRemover(inputCol="cleaned_words", outputCol="words_filtered")
htf_content = HashingTF(inputCol="words_filtered", outputCol="content_htf")  

va = VectorAssembler(inputCols=["content_htf"], outputCol="features")
lr = LogisticRegression(labelCol = 'label', featuresCol = 'features')

paramGrid = ParamGridBuilder() \
    .addGrid(htf_content.numFeatures, [10]) \
    .addGrid(lr.maxIter, [5]) \
    .addGrid(lr.regParam, [0.1]) \
    .build()

content_cols_pipeline = Pipeline(stages = [
                                          tok_content,
                                          cleaned_token,
                                          remover_content,
                                          htf_content,
                                          va,
                                          lr
                                         ])

crossval = CrossValidator(estimator = content_cols_pipeline,
                          estimatorParamMaps = paramGrid,
                          evaluator = BinaryClassificationEvaluator(),
                          numFolds=5)

In [None]:
lr_words_cvModel, lr_words_prediction = cross_val_train_test(crossval, training, testing)

In [None]:
binary_evaluation_metrics(lr_words_prediction)

##### Content Columns without URL, Add Simple Cols

In [None]:
va = VectorAssembler(inputCols=["content_htf", 'url_count', 'char_count', 'word_count', 'emoji_count', 'following', 'followers'], outputCol="features")
lr = LogisticRegression(labelCol = 'label', featuresCol = 'features')

#paramGrid = ParamGridBuilder() \
#    .addGrid(htf_content.numFeatures, [10, 50]) \
#    .addGrid(lr.maxIter, [10]) \
#    .addGrid(lr.regParam, [0.01]) \
#    .build()

#paramGrid = ParamGridBuilder() \
#    .addGrid(htf_content.numFeatures, [10]) \
#    .addGrid(lr.maxIter, [5]) \
#    .addGrid(lr.regParam, [0.1]) \
#    .build()


content_simple_cols_pipeline = Pipeline(stages = [
                                          tok_content,
                                          cleaned_token,
                                          remover_content,
                                          htf_content,
                                          va,
                                          lr
                                         ])

crossval = CrossValidator(estimator = content_simple_cols_pipeline,
                          estimatorParamMaps = paramGrid,
                          evaluator = BinaryClassificationEvaluator(),
                          numFolds=5)

In [None]:
lr_complex_cvModel, lr_complex_prediction = cross_val_train_test(crossval, training, testing)

In [None]:
binary_evaluation_metrics(lr_copmlex_prediction)

#### Model Subset 2: Random Forest

##### Simple Columns

In [None]:
va = VectorAssembler(inputCols = simple_cols, outputCol = 'features')
rf = RandomForestClassifier(featuresCol = 'features', labelCol = 'label')
paramGrid = ParamGridBuilder() \
    .addGrid(rf.maxDepth, [3, 5, 10]) \
    .addGrid(rf.numTrees, [5, 10, 20]) \
    .build()

rf_simple_cols_pipeline = Pipeline(stages = [va, rf])

crossval = CrossValidator(estimator = rf_simple_cols_pipeline,
                          estimatorParamMaps = paramGrid,
                          evaluator = BinaryClassificationEvaluator(),
                          numFolds=5)

In [None]:
rf_simple_cvModel, rf_simple_prediction = cross_val_train_test(crossval, training, testing)

In [None]:
binary_evaluation_metrics(rf_simple_prediction)

In [None]:
rf_simple_cvModel.save('rf_small_model_4_8')

##### Content and Simple

In [None]:
from pyspark.ml.classification import RandomForestClassifier

va = VectorAssembler(inputCols=["content_htf", 'url_count', 'char_count', 'word_count', 'emoji_count', 'following', 'followers'], outputCol="features")
rf = RandomForestClassifier(featuresCol = 'features', labelCol = 'label')

paramGrid = ParamGridBuilder() \
    .addGrid(htf_content.numFeatures, [10, 50]) \
    .addGrid(rf.maxDepth, [5, 10]) \
    .addGrid(rf.numTrees, [20]) \
    .build()

content_cols_pipeline = Pipeline(stages = [
                                          tok_content,
                                          cleaned_token,
                                          remover_content,
                                          htf_content,
                                          va,
                                          rf
                                         ])

crossval = CrossValidator(estimator = content_cols_pipeline,
                          estimatorParamMaps = paramGrid,
                          evaluator = BinaryClassificationEvaluator(),
                          numFolds=5)


In [None]:
rf_complex_cvModel, rf_complex_prediction = cross_val_train_test(crossval, training, testing)

In [None]:
binary_evaluation_metrics(rf_complex_prediction)

## SECTION 5: Model Evaluation