# IDS 561 Homework 2
Isaac Salvador<br>UIN: 669845132

## Dataset
The first task prior to data manipulation is importing the necessary `spark` packages and the _**Amazon_Responded_Oct05.csv**_ file as a `spark` `rdd`.

In [1]:
from pyspark import SparkContext, SparkConf
import pandas as pd

conf = SparkConf().setAppName("IDS561_HW2").set("spark.driver.bindAddress", "127.0.0.1")
sc = SparkContext(conf=conf)
sc.setLogLevel("OFF")

# create initial rdd with raw csv file data
rdd = sc.textFile("data/Amazon_Responded_Oct05.csv")

23/09/29 22:16:16 WARN Utils: Your hostname, Isaacs-MacBook-Pro.local resolves to a loopback address: 127.0.0.1; using 10.0.0.163 instead (on interface en0)
23/09/29 22:16:16 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
23/09/29 22:16:16 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


We can then use the `take()` method within in the `head()` function to see the first 5 records of the rdd:

In [7]:
def head(rdd, n: int=5) -> None:
    for i in rdd.take(n):
        print(i)


head(rdd)

id_str,tweet_created_at,user_screen_name,user_id_str,user_statuses_count,user_favourites_count,user_protected,user_listed_count,user_following,user_description,user_location,user_verified,user_followers_count,user_friends_count,user_created_at,tweet_language,text_,favorite_count,favorited,in_reply_to_screen_name,in_reply_to_status_id_str,in_reply_to_user_id_str,retweet_count,retweeted,text

'793270689780203520',Tue Nov 01 01:57:25 +0000 2016,SeanEPanjab,143515471,51287,4079,False,74,False,"Content marketer; Polyglot; Beard aficionado; Sikh. Persian, Catalan, French, Spanish. You'll find lol in the interstitial lulls of my tweets.",غریب الوطن,False,1503,850,Thu May 13 17:43:52 +0000 2010,en,@AmazonHelp Can you please DM me? A product I ordered last year never arrived.,0,False,AmazonHelp,,85741735,0,False,
'793281386912354304',Tue Nov 01 02:39:55 +0000 2016,AmazonHelp,85741735,2225450,11366,False,796,False,We answer Amazon support questions 7 days a week. Support available in English / D

It is now apparent that there are two issues with the `rdd`. Spark is not context-aware and the headers appear as an instance in the data, and there are empty rows scattered throughout the `rdd`. We can make use of the `.filter()` method to remove the errant rows.

In [8]:
# get first string in rdd
header = rdd.first()
column_names = header.split(",")

rdd1 = rdd\
    .filter(lambda x: x != header)\
    .filter(lambda x: x != '')

We can also conveniently show the indices and column names as a byproduct of this operation.

In [9]:
# print indices and columns of rdd1
print("index\tcolumn name")
print("-------------------")

for i, column in enumerate(column_names):
    print(f"{i}\t{column}")


index	column name
-------------------
0	id_str
1	tweet_created_at
2	user_screen_name
3	user_id_str
4	user_statuses_count
5	user_favourites_count
6	user_protected
7	user_listed_count
8	user_following
9	user_description
10	user_location
11	user_verified
12	user_followers_count
13	user_friends_count
14	user_created_at
15	tweet_language
16	text_
17	favorite_count
18	favorited
19	in_reply_to_screen_name
20	in_reply_to_status_id_str
21	in_reply_to_user_id_str
22	retweet_count
23	retweeted
24	text


We will now make the assumption that valid records in the dataset correspond to those whose whose lists created by the `split()` method contain 25 elements, as shown in the above column name summary. We can also use this operation to extract the six relevant columns `id_str`, `tweet_created_at`, `user_verified`, `favorite count`, `retweet_count`, and `text_` needed for further analysis.

In [5]:
relevant_indices = [0,1,11,17,22,16]

rdd2 = rdd1\
    .filter(lambda x: len(x.split(",")) == 25)\
    .map(lambda x: [x.split(",")[i] for i in relevant_indices])

print([column_names[i] for i in relevant_indices])
head(rdd2, 5)

['id_str', 'tweet_created_at', 'user_verified', 'favorite_count', 'retweet_count', 'text_']
["'793502854459879424'", 'Tue Nov 01 17:19:57 +0000 2016', 'True', '0', '0', '@SeanEPanjab Please give us a call/chat so we can look into this order for you: https://t.co/hApLpMlfHN. ^HB']
["'793513446633533440'", 'Tue Nov 01 18:02:03 +0000 2016', 'True', '0', '0', "@SeanEPanjab I'm not able to access account info here. Please reach out by Phone/Chat so we can look at this: https://t.co/EKXRLsnxJu ^GL"]
["'793299404975247360'", 'Tue Nov 01 03:51:31 +0000 2016', 'False', '0', '0', "@JeffBezos @amazonIN @AmazonHelp Tring...Tring...Tring Who's There? Your Suffering Customers from India Mr. Bezos...Get Up and Help!"]
["'793407430344310785'", 'Tue Nov 01 11:00:46 +0000 2016', 'False', '0', '0', '@AmazonHelp How many times do you want to write back to you guys??? Check the complaints through email regarding  Order 171-1338898-5999507.']
["'793423313674571776'", 'Tue Nov 01 12:03:53 +0000 2016', 'True'

## Task 1
### _Step 1_
To remove records where `"user_verfied"` is `"False"`, we use the `filter()` method accordingly. Subseqeuntly, the `map()` function is used to remove the `"user_verified"` column and convert the `"favorite_count"` and `"retweet_count"` columns to `int()`.

In [76]:
rdd3 = rdd2\
    .filter(lambda x: x[2] == "True")\
    .map(lambda x: [x[0], x[1], int(x[3]), int(x[4]), x[5]] )

print(f"Number of verified records: {rdd3.count()}")

Number of verified records: 100965


### _Step 2_
In order to group by `"tweet_created_at"`, we first perform string manipulation to extract the month and date in the form `"MMM 00"`.



In [46]:
rdd4 = rdd3.map(
    lambda x: [x[0], x[1][4:10], *x[2:]]
)

In [87]:
head(rdd4)

["'793502854459879424'", 'Nov 01', 0, 0, '@SeanEPanjab Please give us a call/chat so we can look into this order for you: https://t.co/hApLpMlfHN. ^HB']
["'793513446633533440'", 'Nov 01', 0, 0, "@SeanEPanjab I'm not able to access account info here. Please reach out by Phone/Chat so we can look at this: https://t.co/EKXRLsnxJu ^GL"]
["'793423313674571776'", 'Nov 01', 0, 0, '@aakashwangnoo Hi! We have responded to you here: https://t.co/v4YVCa3rff ^SG(1/2)']
["'793423314333134850'", 'Nov 01', 0, 0, '@aakashwangnoo Please don’t provide your order details as we consider them to be personal info. Our page is visible to public. ^SG(2/2)']
["'793551822476705793'", 'Nov 01', 0, 0, "@aakashwangnoo Hey! I see that we've already responded to your query. That is the best possible resolution we can provide 1/2 ^MM"]


To obtain the number of tweets created per day, we apply the following `rdd` methods:
1. `groupBy()` – Records are grouped by the modified `tweet_created_at` column.
2. `map()` – Records are mapped to date and count of records per date.
3. `collect()` – `rdd4` is collected as a regular python object for analysis.

In [86]:
tweet_dates = rdd4\
    .groupBy(
        # use groupBy() to group by tweet_created_at
        lambda x: x[1]
    )\
    .map(
        # use map() to collect dates and number of corresponding records
        lambda x: (x[0], len(list(x[1])))
    ).collect()

# obtain the date with the most tweets
biggest_day, most_tweets = max(tweet_dates, key = lambda x: x[1])

print(f"{biggest_day} had {most_tweets} tweets, the highest number of tweets in the dataset.")

Jan 11 had 893 tweets, the highest number of tweets in the dataset.


### _Step 3.1: Sum of `"favoritecount"` and `"retweet_count"`_
To get the sum of `"favoritecount"` and `"retweet_count"` for each tweet on `biggest_day`, we perform the following operations on `rdd4`:
1. `filter()` – Filter tweets to those who were created on `biggest_day`
2. `map()` – Map `rdd5` to a set of key-value pairs that corresponds to `"id_str"` and sum of `favoritecount` and `retweet_count`


In [98]:
rdd5 = rdd4\
    .filter(
        lambda x: x[1] == biggest_day
    )\
    .map(
        lambda x: [x[0], x[2]+x[3]]
    )
    
head(rdd5, 10)

["'819265252453941249'", 0]
["'819006776070770690'", 0]
["'819154757193498624'", 1]
["'819194260243312640'", 0]
["'819251153112367105'", 0]
["'819235424476495872'", 1]
["'819167359734874112'", 0]
["'819045790878408704'", 0]
["'819048545508622336'", 0]
["'819256450543472640'", 0]


In [109]:
rdd6 = rdd5.sortBy(
    lambda x: x[1], ascending=False
)

In [117]:
top_100_tweets = rdd6\
    .leftOuterJoin(
        rdd4.map(
            lambda x: [x[0], x[-1]]
        )
    ).take(100)

                                                                                

In [165]:
word_freq_counts = {}

for tweet in top_100_tweets:
    text = tweet[1][1]
    
    for token in text.split():
        if token not in word_freq_counts.keys():
            word_freq_counts[token] = 1
        else:
            word_freq_counts[token] += 1

sorted(word_freq_counts.items(), key = lambda x: x[1], reverse=True)

[('to', 66),
 ('the', 62),
 ('you', 44),
 ('here:', 28),
 ('this', 27),
 ('your', 27),
 ('for', 26),
 ('us', 22),
 ('sorry', 21),
 ('can', 19),
 ("I'm", 19),
 ('We', 14),
 ('You', 14),
 ('Please', 14),
 ('delivery', 13),
 ('our', 13),
 ('a', 13),
 ('and', 13),
 ('on', 12),
 ('of', 12),
 ('we', 12),
 ('order', 11),
 ('Have', 11),
 ('with', 11),
 ('in', 11),
 ('like', 11),
 ('have', 10),
 ('is', 10),
 ('date', 9),
 ('out', 9),
 ('What', 9),
 ('it', 8),
 ('by', 8),
 ('reach', 8),
 ('look', 8),
 ('or', 8),
 ('an', 8),
 ('were', 7),
 ('any', 7),
 ('report', 7),
 ('^KD', 7),
 ('^MT', 7),
 ('that', 7),
 ('these', 7),
 ('contact', 7),
 ('be', 7),
 ('from', 7),
 ('time', 7),
 ('this.', 6),
 ('^KH', 6),
 ('into', 6),
 ('Sorry', 6),
 ('if', 6),
 ('sold', 5),
 ('^HK', 5),
 ("We're", 5),
 ('Did', 5),
 ('Thanks', 5),
 ('you.', 5),
 ('was', 5),
 ('are', 5),
 ('using', 5),
 ('when', 5),
 ('hear', 5),
 ('would', 5),
 ('at', 5),
 ('options', 5),
 ('received', 5),
 ('about', 5),
 ('^AE', 5),
 ("We'd", 5)

In [152]:
find_text_rdd = sc.textFile("data/find_text.csv")

test = find_text_rdd\
    .map()
    .leftOuterJoin(
        rdd2.map(
            lambda x: (x[0][1][-1], x[-1])
        )
    )