# BDP Final Assignment: Twitter Education, Part 2 (Data Analysis)

`Recall`: From the data processing session, we obtained a filtered dataset from the ~100 million tweets. In this notebook, we are trying to accomplish:


In [1]:
#Ensure we are using the right kernel
spark.version

'3.1.3'

In [2]:
import pandas as pd
import numpy as np
pd.set_option('display.max_colwidth', None)
pd.reset_option('display.max_rows')

#from itertools import compress 
import seaborn as sns 
import matplotlib.pyplot as plt

#import warnings
#warnings.filterwarnings(action='ignore')
#warnings.simplefilter('ignore')

In [3]:
from pyspark.sql.functions import *
from pyspark.sql import functions as F
from pyspark.sql.types import *
from pyspark.sql import SparkSession

GCP Tools and Functions

In [4]:
from google.cloud import storage

In [14]:
# Reading data from open bucket
# Located at my BDP-bucket: gs://msca-bdp-students-bucket/shared_data/hjiang248/final_sdf
dataPath = 'gs://msca-bdp-students-bucket/shared_data/hjiang248/final_sdf_v9'
dataPath2 = 'gs://msca-bdp-students-bucket/shared_data/hjiang248/final_sdf_v9_formatted_2'

In [6]:
spark = SparkSession.builder.getOrCreate()
spark.conf.set("spark.sql.repl.eagerEval.enabled",True)

# Read data

In [15]:
%%time

educationDF = spark.read.parquet(dataPath)

CPU times: user 4.43 ms, sys: 66 µs, total: 4.49 ms
Wall time: 1.32 s


In [16]:
educationDF.printSchema()

root
 |-- created_at: string (nullable = true)
 |-- id: long (nullable = true)
 |-- lang: string (nullable = true)
 |-- text: string (nullable = true)
 |-- retweet_count: long (nullable = true)
 |-- favorite_count: long (nullable = true)
 |-- quote_count: long (nullable = true)
 |-- retweeted: string (nullable = true)
 |-- rtstatus_favorite_count: long (nullable = true)
 |-- rtstatus_retweet_count: long (nullable = true)
 |-- rtstatus_quote_count: long (nullable = true)
 |-- rt_hashtags: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- source_rt_usr_id: long (nullable = true)
 |-- source_rt_id: long (nullable = true)
 |-- location: string (nullable = true)
 |-- country: string (nullable = true)
 |-- country_code: string (nullable = true)
 |-- verified_user: boolean (nullable = true)
 |-- user_id: long (nullable = true)
 |-- user_name: string (nullable = true)
 |-- followers_count: long (nullable = true)
 |-- user_description: string (nullable = true)



In [17]:
educationDF.select(['favorite_count', 'retweet_count', 'retweeted', 'quote_count',
                     'rtstatus_favorite_count', 'rtstatus_retweet_count', 'rtstatus_quote_count']).describe()

                                                                                

summary,favorite_count,retweet_count,retweeted,quote_count,rtstatus_favorite_count,rtstatus_retweet_count,rtstatus_quote_count
count,7448390.0,7448390.0,7448390,7448390.0,5422324.0,5422324.0,5422324.0
mean,0.0,0.0,,0.0,11600.210113781472,2177.632209362628,331.7464070756377
stddev,0.0,0.0,,0.0,37840.520876375056,7046.1554961729,2066.069874766467
min,0.0,0.0,,0.0,0.0,1.0,0.0
max,0.0,0.0,RT,0.0,1213547.0,239754.0,69196.0


22/12/08 00:08:53 WARN org.apache.spark.deploy.yarn.YarnAllocator: Container from a bad node: container_1670446778819_0014_01_000003 on host: hub-msca-bdp-dphub-students-backup-hjiang248-sw-xrhd.c.msca-bdp-students.internal. Exit status: 143. Diagnostics: [2022-12-08 00:08:53.841]Container killed on request. Exit code is 143
[2022-12-08 00:08:53.842]Container exited with a non-zero exit code 143. 
[2022-12-08 00:08:53.846]Killed by external signal
.
22/12/08 00:08:53 WARN org.apache.spark.deploy.yarn.YarnAllocator: Container from a bad node: container_1670446778819_0014_01_000002 on host: hub-msca-bdp-dphub-students-backup-hjiang248-sw-xrhd.c.msca-bdp-students.internal. Exit status: 143. Diagnostics: [2022-12-08 00:08:53.841]Container killed on request. Exit code is 143
[2022-12-08 00:08:53.841]Container exited with a non-zero exit code 143. 
[2022-12-08 00:08:53.842]Killed by external signal
.
22/12/08 00:08:53 WARN org.apache.spark.scheduler.cluster.YarnSchedulerBackend$YarnScheduler

We can see that the favorite_count, retweet_count, retweeted, and quote_count are missing from the raw data. Meanwhile, they are available under the structure type called retweet_status. Hence, it is better that we use variables under that category

# EDA

## Number of tweets

In [18]:
educationDF.count()

                                                                                

7448390

## Verified users

In [19]:
educationDF_verified = educationDF.select(['user_id', 'verified_user']).dropDuplicates()

In [20]:
educationDF_verified.groupBy('verified_user').agg(count('*'))

                                                                                

verified_user,count(1)
True,42013
False,3111273


## User by Organization

Note that when tagging organizations & countries, processing time, we excluded some tweets that have all null values or deuplications. So the total tweets are slightly reduced.

In [None]:
educationDF = spark.read.parquet(dataPath2)

In [25]:
org = educationDF.select(['organization', 'user_id'])

In [27]:
org_count = org.groupBy('organization').count()
org_count = org_count.orderBy('count', ascending=False)

In [28]:
org_count

                                                                                

organization,count
Other,6150765
News_Media,53359
Celebrity_Influencer,11095
Universities,8598
Government_Entities,7918
NGOs,2339
Schools,1206
