# Now You Code In Class: 

## Tricks of The Pandas Masters Volume II

Once again, we will try something a bit different for our Activity - A series of Pandas coding challenges!

Datasets we will use:

- Reddit Data: https://raw.githubusercontent.com/mafudge/datasets/master/json-samples/reddit.json
- Episodes of the HBO series "The Wire": https://raw.githubusercontent.com/mafudge/datasets/master/tv-shows/the-wire.json


In [3]:
import pandas as pd
import warnings

pd.set_option('display.max_colwidth', None)
warnings.filterwarnings('ignore')

## Deserializing json

the following function takes a URL as input and returns deserialized json as output.


In [4]:
def get_json(url: str):
    import requests
    response = requests.get(url)
    response.raise_for_status()
    return response.json()


This example deserializes reddit news and then finds the key where the articles are and displays the first artcle.

In [7]:
reddit = get_json("https://raw.githubusercontent.com/mafudge/datasets/master/json-samples/reddit.json")
articles = reddit["data"]["children"]
articles[0]

{'kind': 't3',
 'data': {'domain': 'wdrb.com',
  'banned_by': None,
  'media_embed': {},
  'subreddit': 'news',
  'selftext_html': None,
  'selftext': '',
  'likes': None,
  'suggested_sort': None,
  'user_reports': [],
  'secure_media': None,
  'link_flair_text': None,
  'id': '4bn12d',
  'from_kind': None,
  'gilded': 0,
  'archived': False,
  'clicked': False,
  'report_reasons': None,
  'author': 'homeboy422',
  'media': None,
  'score': 1,
  'approved_by': None,
  'over_18': False,
  'hidden': False,
  'num_comments': 1,
  'thumbnail': '',
  'subreddit_id': 't5_2qh3l',
  'hide_score': False,
  'edited': False,
  'link_flair_css_class': None,
  'author_flair_css_class': None,
  'downs': 0,
  'secure_media_embed': {},
  'saved': False,
  'removal_reason': None,
  'stickied': False,
  'from': None,
  'is_self': False,
  'from_id': None,
  'permalink': '/r/news/comments/4bn12d/man_arrested_for_stealing_suv_told_police_he/',
  'locked': False,
  'name': 't3_4bn12d',
  'created': 145877

In [12]:
# PROMPT 1 read in episodes of "The Wire" find the episodes key, and show the first episode.
thewire = get_json("https://raw.githubusercontent.com/mafudge/datasets/master/tv-shows/the-wire.json")
episodes = thewire["_embedded"]['episodes']

In [13]:
episodes[5]

{'id': 12912,
 'url': 'https://www.tvmaze.com/episodes/12912/the-wire-1x06-the-wire',
 'name': 'The Wire',
 'season': 1,
 'number': 6,
 'type': 'regular',
 'airdate': '2002-07-07',
 'airtime': '21:00',
 'airstamp': '2002-07-08T01:00:00+00:00',
 'runtime': 60,
 'rating': {'average': 8.5},
 'image': {'medium': 'https://static.tvmaze.com/uploads/images/medium_landscape/94/236933.jpg',
  'original': 'https://static.tvmaze.com/uploads/images/original_untouched/94/236933.jpg'},
 'summary': '<p><i>"...and all the pieces matter." - Freamon</i><br />Brandon\'s bloodied body is discovered in the pit. Wallace gets even more unsettled about the situation after Avon rewards him for his part in Brandon\'s murder. The detail gets a wiretap running. Daniels clashes with homicide MajorWilliam Rawlsover their approach to the evidence they have gathered thus far.</p>',
 '_links': {'self': {'href': 'https://api.tvmaze.com/episodes/12912'}}}

## Creaating dataframe

In this example we use`json_normalize()` to display the articles in a dataframe

In [16]:
art_df = pd.json_normalize(articles) #articles is a list of dict!
art_df.head(1)

Unnamed: 0,kind,data.domain,data.banned_by,data.subreddit,data.selftext_html,data.selftext,data.likes,data.suggested_sort,data.user_reports,data.secure_media,...,data.url,data.author_flair_text,data.quarantine,data.title,data.created_utc,data.distinguished,data.mod_reports,data.visited,data.num_reports,data.ups
0,t3,wdrb.com,,news,,,,,[],,...,http://www.wdrb.com/story/31546565/man-arrested-for-stealing-suv-told-police-he-needed-it-for-job-interview,,False,Man arrested for stealing SUV told police he needed it for job interview,1458748000.0,,[],False,,1


In [17]:
# PROMPT 2 display the episodes as a dataframe
epdf = pd.json_normalize(episodes)
epdf.head(1)

Unnamed: 0,id,url,name,season,number,type,airdate,airtime,airstamp,runtime,summary,rating.average,image.medium,image.original,_links.self.href
0,12907,https://www.tvmaze.com/episodes/12907/the-wire-1x01-the-target,The Target,1,1,regular,2002-06-02,21:00,2002-06-03T01:00:00+00:00,60,"<p>Homicide detective Jimmy McNulty observes the murder trial of a mid-level drug dealer, D'Angelo Barksdale, and sees the prosecution's star witness recant her testimony. McNulty recognises drug king-pin Stringer Bell in the court room and believes he has manipulated the proceedings, so he circumvents the chain-of-command by talking to the judge, Daniel Phelan, who then places pressure on the police department to investigate the Barksdale drug-dealing organization, which, McNulty claims, has gotten away with ten murders in the last year. D'Angelo is welcomed home by his uncle, Barksdale patriarch, Avon, who is frustrated with him for placing himself in a situation where the police could charge him. Nevertheless, Avon allows him to return to work, but in what D'Angelo sees as a demotion, he is moved to a low-rise housing project known as ""the pit."" Meanwhile, homeless drug addict Bubbles, acts as mentor to another addict, Johnny Weeks, in an ill-conceived scam with severe consequences.</p>",7.8,https://static.tvmaze.com/uploads/images/medium_landscape/94/236937.jpg,https://static.tvmaze.com/uploads/images/original_untouched/94/236937.jpg,https://api.tvmaze.com/episodes/12907


## A Text Sentiment Web Service

The following function uses the http://text-processing.com service to calculate the sentiment for any input text. The result is a dict with probabilities and overall sentiment. For example:

INPUT: `"very nice"`

OUTPUT: 
```
{
    'probability': {
        'neg': 0.28997418956645504,
        'neutral': 0.13591527211268692,
        'pos': 0.710025810433545
    },
    'label': 'pos'
}


In [25]:
def get_text_sentiment(text: str) -> dict:
    import requests
    response = requests.post(
        "http://text-processing.com/api/sentiment/", 
        data={"text": text},
        headers={"User-Agent": "Mozillia/5.0"}
    )
    response.raise_for_status()
    return response.json()

def get_text_sentiment(text: str) -> dict:
    import random
    return random.choice(["Positive", "Negative", "Neutral"])


sentiment = get_text_sentiment("very nice")
sentiment

'Positive'

In [23]:
# PROMPT 3: get the sentiment of the text below and print the label only


## Using a lambda with `apply()`

The following example creates a new Series in the dataframe called `"title_sentiment"` which calculates the sentiment of the reddit news title.

In [29]:
art_df["title_sentiment"] = art_df.apply(lambda row: get_text_sentiment(row["data.title"]), axis=1)
art_df[["data.title", "title_sentiment"]].head()

Unnamed: 0,data.title,title_sentiment
0,Man arrested for stealing SUV told police he needed it for job interview,Neutral
1,Zika outbreak needs $4 million: WHO,Neutral
2,"Austin police to fire officer who shot, killed nude teen",Positive
3,Hubble Unveils Monster Stars,Neutral
4,ISIS application forms surface in intelligence haul on terror group's recruits,Negative


In [31]:
# PROMPT 4 get the sentiment for the summary Series in the episodes dataframe, name the column "summary_sentiment" and show the summary and the sentiment
epdf['summary_sentiment'] = epdf.apply(
    lambda row: get_text_sentiment(row['summary']
) , axis=1)
epdf.head(1)

Unnamed: 0,id,url,name,season,number,type,airdate,airtime,airstamp,runtime,summary,rating.average,image.medium,image.original,_links.self.href,summary_sentiment
0,12907,https://www.tvmaze.com/episodes/12907/the-wire-1x01-the-target,The Target,1,1,regular,2002-06-02,21:00,2002-06-03T01:00:00+00:00,60,"<p>Homicide detective Jimmy McNulty observes the murder trial of a mid-level drug dealer, D'Angelo Barksdale, and sees the prosecution's star witness recant her testimony. McNulty recognises drug king-pin Stringer Bell in the court room and believes he has manipulated the proceedings, so he circumvents the chain-of-command by talking to the judge, Daniel Phelan, who then places pressure on the police department to investigate the Barksdale drug-dealing organization, which, McNulty claims, has gotten away with ten murders in the last year. D'Angelo is welcomed home by his uncle, Barksdale patriarch, Avon, who is frustrated with him for placing himself in a situation where the police could charge him. Nevertheless, Avon allows him to return to work, but in what D'Angelo sees as a demotion, he is moved to a low-rise housing project known as ""the pit."" Meanwhile, homeless drug addict Bubbles, acts as mentor to another addict, Johnny Weeks, in an ill-conceived scam with severe consequences.</p>",7.8,https://static.tvmaze.com/uploads/images/medium_landscape/94/236937.jpg,https://static.tvmaze.com/uploads/images/original_untouched/94/236937.jpg,https://api.tvmaze.com/episodes/12907,Negative


## Saving the dataframe 

Calculating the sentiment isn't free and we don't need to do it more than once. Its a good idea to save the content of our dataframe at this point.  That way we can continue from the saved file versus re-calculating the transformation.

After you run this code open the `reddit_news_articles.csv` file and check it out!

In [None]:
art_df.to_csv("reddit_news_articles.csv", index=False)

In [33]:
# PROMPT 5 save your episodes of the wire to "the_wire_episodes.csv"
epdf.to_csv("the_wire_episodes.csv", header=True, index=False)

## Hot and Top articles.

You have been hired as a data scientist at Reddit. Through your extensive statistical research, you have determined: 

- Articles where the number of comments are in the 75% percentile should be categorized as `"Hot"`
- Articles where the number of upvotes are in the 75% percentile should be categorized as `"Popular"`
- Articles matching both should be categories as `"Fire"`

In these next challenges, we will write code to implement these rules.


In [34]:
# PROMPT 6 write code to show the statistics of the numerical columns in the `art_df`
art_df.describe()

Unnamed: 0,data.gilded,data.score,data.num_comments,data.downs,data.created,data.created_utc,data.ups
count,25.0,25.0,25.0,25.0,25.0,25.0,25.0
mean,0.0,30.44,10.92,0.0,1458733000.0,1458704000.0,30.44
std,0.0,74.512348,22.597788,0.0,36881.08,36881.08,74.512348
min,0.0,0.0,0.0,0.0,1458655000.0,1458627000.0,0.0
25%,0.0,1.0,1.0,0.0,1458700000.0,1458671000.0,1.0
50%,0.0,8.0,2.0,0.0,1458727000.0,1458698000.0,8.0
75%,0.0,27.0,11.0,0.0,1458767000.0,1458738000.0,27.0
max,0.0,376.0,108.0,0.0,1458791000.0,1458762000.0,376.0


In [38]:
# PROMPT 7 which columns represent comments? upvotes? What are the thresholds based on the 75% percentile?
hot_comments_minimum = art_df.describe().loc["75%", "data.num_comments"]
pop_comments_minimum = art_df.describe().loc["75%", "data.ups"]


KeyError: '75%'

In [None]:
# PROMPT 8 Write code to extract the comments_threshold


In [None]:
# PROMPT 9 Write code to extract the upvote_threshold


### defining the `categorize()` function:

Let's review the requirements: 

- Articles where the number of comments are in the 75% percentile should be categorized as "Hot"
- Articles where the number of upvotes are in the 75% percentile should be categorized as "Popular"
- Articles matching both should be categories as "Fire"

INPUTS (4): 

    PROMPT 10
    
OUTPUTS (1):

    PROMPT 11

In [None]:
# PROMPT 12 write function definition


In [None]:
# PROMPT 13a write test for "Hot"
upvotes = 0
upvote_threshold = 0
comments = 0
comments_threshold = 0
expect = "????"
actual = categorize(upvotes, upvote_threshold, comments, comments_threshold)
print(f"When upvotes={upvotes}/{upvote_threshold}, comments={comments}/{comments_threshold}, EXPECT={expect}, ACTUAL={actual}")
assert expect == actual

In [None]:
# PROMPT 13b write test for "Popular"
upvotes = 0
upvote_threshold = 0
comments = 0
comments_threshold = 0
expect = "????"
actual = categorize(upvotes, upvote_threshold, comments, comments_threshold)
print(f"When upvotes={upvotes}/{upvote_threshold}, comments={comments}/{comments_threshold}, EXPECT={expect}, ACTUAL={actual}")
assert expect == actua

In [None]:
# PROMPT 13c write test for "Fire"
upvotes = 0
upvote_threshold = 0
comments = 0
comments_threshold = 0
expect = "????"
actual = categorize(upvotes, upvote_threshold, comments, comments_threshold)
print(f"When upvotes={upvotes}/{upvote_threshold}, comments={comments}/{comments_threshold}, EXPECT={expect}, ACTUAL={actual}")
assert expect == actua

In [None]:
# PROMPT 13d write test for ""
upvotes = 0
upvote_threshold = 0
comments = 0
comments_threshold = 0
expect = "????"
actual = categorize(upvotes, upvote_threshold, comments, comments_threshold)
print(f"When upvotes={upvotes}/{upvote_threshold}, comments={comments}/{comments_threshold}, EXPECT={expect}, ACTUAL={actual}")
assert expect == actua

## Final step. lambda / apply

Add a column to the dataframe called `"category"` which calculates the category for the article 

output the article title, the number of comments, number of upvotes and the category

In [None]:
# PROMPT 14


In [None]:
# run this code to turn in your work!
from casstools.assignment import Assignment
Assignment().submit()