### 1. Data Import and Preprocessing

#### 1.1 Data Import

In [None]:
# import modules
import pandas as pd
from pyspark.sql import SparkSession

In [None]:
# build spark session and spark context
spark = SparkSession.builder \
        .appName("fb") \
        .getOrCreate()
sc = spark.sparkContext

**Read in Data**

In [206]:
fb = spark.read.csv('Features_Variant_1.csv')
fb.first()

Row(_c0='634995', _c1='0', _c2='463', _c3='1', _c4='0.0', _c5='806.0', _c6='11.291044776119403', _c7='1.0', _c8='70.49513846124168', _c9='0.0', _c10='806.0', _c11='7.574626865671642', _c12='0.0', _c13='69.435826365571', _c14='0.0', _c15='76.0', _c16='2.6044776119402986', _c17='0.0', _c18='8.50550186882253', _c19='0.0', _c20='806.0', _c21='10.649253731343284', _c22='1.0', _c23='70.25478763764251', _c24='-69.0', _c25='806.0', _c26='4.970149253731344', _c27='0.0', _c28='69.85058043098057', _c29='0', _c30='0', _c31='0', _c32='0', _c33='0', _c34='65', _c35='166', _c36='2', _c37='0', _c38='24', _c39='0', _c40='0', _c41='0', _c42='1', _c43='0', _c44='0', _c45='0', _c46='0', _c47='0', _c48='0', _c49='0', _c50='0', _c51='0', _c52='1', _c53='0')

**Rename Columns**

In [209]:
# name dataframe with correct column names as "df"
df = (fb.withColumnRenamed('_c0', 'PageLikes')              # page likes
             .withColumnRenamed('_c1', 'PageCheckin')       # page's "Check-in" count
             .withColumnRenamed('_c2', 'PageTalking')       # page's "Talking About" count
             .withColumnRenamed('_c3', 'PageCategory')      # page category
             
            # C1: total comment count before selected base date/time
             .withColumnRenamed('_c4', 'C1min')             # min of C1
             .withColumnRenamed('_c5', 'C1max')             # max of C1
             .withColumnRenamed('_c6', 'C1avg')             # avg of C1
             .withColumnRenamed('_c7', 'C1med')             # median of C1
             .withColumnRenamed('_c8', 'C1std')             # standard deviation of C1
      
            # C2: comment count in last 24 hrs w.r.t base date/time
             .withColumnRenamed('_c9', 'C2min')             # min of C2
             .withColumnRenamed('_c10', 'C2max')            # max of C2
             .withColumnRenamed('_c11', 'C2avg')            # avg of C2
             .withColumnRenamed('_c12', 'C2med')            # median of C2
             .withColumnRenamed('_c13', 'C2std')            # standard deviation of C2
    
            # C3: comment count from last 48 to last 24 hrs w.r.t base date/time
             .withColumnRenamed('_c14', 'C3min')            # min of C3
             .withColumnRenamed('_c15', 'C3max')            # max of C3
             .withColumnRenamed('_c16', 'C3avg')            # avg of C3
             .withColumnRenamed('_c17', 'C3med')            # median of C3
             .withColumnRenamed('_c18', 'C3std')            # standard deviation of C3

            # C4: comment count in first 24 hrs after publishing document but before base date/time
             .withColumnRenamed('_c19', 'C4min')            # min of C4
             .withColumnRenamed('_c20', 'C4max')            # max of C4
             .withColumnRenamed('_c21', 'C4avg')            # avg of C4
             .withColumnRenamed('_c22', 'C4med')            # median of C4
             .withColumnRenamed('_c23', 'C4std')            # standard deviation of C4
      
            # C5: difference between C2 and C3
             .withColumnRenamed('_c24', 'C5min')            # min of C5
             .withColumnRenamed('_c25', 'C5max')            # max of C5
             .withColumnRenamed('_c26', 'C5avg')            # avg of C5
             .withColumnRenamed('_c27', 'C5med')            # median of C5
             .withColumnRenamed('_c28', 'C5std')            # standard deviation of C5
      
            # Other features
             .withColumnRenamed('_c29', 'CC1')              # total number of comments before selected base date/time
             .withColumnRenamed('_c30', 'CC2')              # number of comments in last 24 hours relative to base date/time
             .withColumnRenamed('_c31', 'CC3')              # number of comments in last 48 hours to last 24 hours relative to base date/time
             .withColumnRenamed('_c32', 'CC4')              # number of comments in the first 24 hours after the publication of post but before base date/time
             .withColumnRenamed('_c33', 'CC5')              # difference between CC2 and CC3
             .withColumnRenamed('_c34', 'BaseTime')         # selected time in order to simulate scenario (this is a decimal between 0-71, not sure what the number specifically means)
             .withColumnRenamed('_c35', 'PostLength')       # character count in post
             .withColumnRenamed('_c36', 'PostShare')        # number of shares of the post, how many people shared this post to their timeline
             .withColumnRenamed('_c37', 'PagePromote')      # if the original poster "promoted" this post
             .withColumnRenamed('_c38', 'HLocal')           # describes the hours past for target variable
             
            # Binary variables that represent the day(Sun-Sat) on which the post was published
             .withColumnRenamed('_c39', 'PostSun')         
             .withColumnRenamed('_c40', 'PostMon')          
             .withColumnRenamed('_c41', 'PostTue') 
             .withColumnRenamed('_c42', 'PostWed') 
             .withColumnRenamed('_c43', 'PostThu') 
             .withColumnRenamed('_c44', 'PostFri')   
             .withColumnRenamed('_c45', 'PostSat') 
      
            # Binary variables that represent the day(Sun-Sat) of selected base Date/Time
             .withColumnRenamed('_c46', 'BaseSun')          
             .withColumnRenamed('_c47', 'BaseMon') 
             .withColumnRenamed('_c48', 'BaseTue')   
             .withColumnRenamed('_c49', 'BaseWed') 
             .withColumnRenamed('_c50', 'BaseThu')   
             .withColumnRenamed('_c51', 'BaseFri')
             .withColumnRenamed('_c52', 'BaseSat') 
      
            # Target Variable: number of comments in next H hrs after comment was posted
             .withColumnRenamed('_c53', 'Comments'))       


**Casting each variable to appropriate datatype.** 

In [8]:
# new dataframe named "df_casted"
df_casted = df.select(
                df.PageLikes.cast('float'),
                df.PageCheckin.cast('float'),
                df.PageTalking.cast('float'),
                df.PageCategory.cast('int'),  # number corresponds to categories in pdf
                df.C1min.cast('float'),
                df.C1max.cast('float'),
                df.C1avg.cast('float'),
                df.C1med.cast('float'),
                df.C1std.cast('float'),
                df.C2min.cast('float'),
                df.C2max.cast('float'),
                df.C2avg.cast('float'),
                df.C2med.cast('float'),
                df.C2std.cast('float'),
                df.C3min.cast('float'),
                df.C3max.cast('float'),
                df.C3avg.cast('float'),
                df.C3med.cast('float'),
                df.C3std.cast('float'),
                df.C4min.cast('float'),
                df.C4max.cast('float'),
                df.C4avg.cast('float'),
                df.C4med.cast('float'),
                df.C4std.cast('float'),
                df.C5min.cast('float'),
                df.C5max.cast('float'),
                df.C5avg.cast('float'),
                df.C5med.cast('float'),
                df.C5std.cast('float'),
                df.CC1.cast('float'),
                df.CC2.cast('float'),
                df.CC3.cast('float'),
                df.CC4.cast('float'),
                df.CC5.cast('float'),
                df.BaseTime.cast('float'),
                df.PostLength.cast('float'),
                df.PostShare.cast('float'),
                df.PagePromote.cast('int'),  # integer encoded
                df.HLocal.cast('float'),
                df.PostSun.cast('int'),
                df.PostMon.cast('int'),
                df.PostTue.cast('int'),
                df.PostWed.cast('int'),
                df.PostThu.cast('int'),
                df.PostFri.cast('int'),
                df.PostSat.cast('int'),
                df.BaseSun.cast('int'),
                df.BaseMon.cast('int'),
                df.BaseTue.cast('int'),
                df.BaseThu.cast('int'),
                df.BaseFri.cast('int'),
                df.BaseSat.cast('int'),
                df.Comments.cast('float')
)

NameError: name 'df' is not defined

In [211]:
# check schema
df_casted.printSchema()

root
 |-- PageLikes: float (nullable = true)
 |-- PageCheckin: float (nullable = true)
 |-- PageTalking: float (nullable = true)
 |-- PageCategory: integer (nullable = true)
 |-- C1min: float (nullable = true)
 |-- C1max: float (nullable = true)
 |-- C1avg: float (nullable = true)
 |-- C1med: float (nullable = true)
 |-- C1std: float (nullable = true)
 |-- C2min: float (nullable = true)
 |-- C2max: float (nullable = true)
 |-- C2avg: float (nullable = true)
 |-- C2med: float (nullable = true)
 |-- C2std: float (nullable = true)
 |-- C3min: float (nullable = true)
 |-- C3max: float (nullable = true)
 |-- C3avg: float (nullable = true)
 |-- C3med: float (nullable = true)
 |-- C3std: float (nullable = true)
 |-- C4min: float (nullable = true)
 |-- C4max: float (nullable = true)
 |-- C4avg: float (nullable = true)
 |-- C4med: float (nullable = true)
 |-- C4std: float (nullable = true)
 |-- C5min: float (nullable = true)
 |-- C5max: float (nullable = true)
 |-- C5avg: float (nullable = true

**Dummy Variable Construction**
this is basically transforming categorical variables into separate, binary-encoded variables. For ex: if variable is DayofWeek and levels are Sun-Sat, then we would create 7 new variables each named Sun-Sat with values of 0 or 1 (0 if not that day 1 if it is). This allows us to input those variables into the regression model, because otherwise a regression model can not process categorical variables. 

Luckily for us, days of the week are already in a dummy-variable format. The only other variable I see that is categorical is *Page Category*, which is numerically encoded with each number representing a different category. (You can view the different categories in the pdf called PageCategories located on the Github) I think dummy variables will not be realistic as there are too many levels of category, so maybe we can see if category makes a large difference through graphs and omit it from the regression.

**Feature Scaling**

**Handling missing values and outliers**

**Handling semi structured/unstructured data**

**Dimension reduction (like PCA)**

### 2. Data Splitting and Sampling
- Split into train/test sets. Try 60/40 and 70/30 splits. We may have to do k-fold cross validation. 


### 3. Exploratory Data Analysis (at least 2 graphs)
- Consider histograms, bar plots, box plots, etc 
- add colors/legends to separate groups

### 4. Model Construction (at least 3 models)
Remember to: Construct each model using pipelines
- Create a regression model as a benchmark model, simple with small number of features. We will use this as a basis of comparison.
- Create 1 or 2 more sophisticated models, look amongst the ones covered in class.

### 5. Model Evaluation

**Metrics for Regression:**
- R-squared (for single factor)
- Adjusted R-squared (for multifactor)

**Metrics for Classification:**
- accuracy
- precision, recall, F1 score
- confusion matrix
- area under ROC curve

### 6. Outside Resources:
- paper explaining some of the variables on pg 16.3: [link](https://ijssst.info/Vol-16/No-5/paper16.pdf)
- assignment i found online based on this dataset, might give us some good questions to ask: [link](http://cse.ucdenver.edu/~biswasa/ml-s18/files/assignments/assignment-01/Programming-Assignment-1.pdf)