# Now You Code In Class: 

## Tricks of The Pandas Masters Volume II

Once again, we will try something a bit different for our Activity - A series of Pandas coding challenges!

Datasets we will use:

- Reddit Data: https://raw.githubusercontent.com/mafudge/datasets/master/json-samples/reddit.json
- Episodes of the HBO series "The Wire": https://raw.githubusercontent.com/mafudge/datasets/master/tv-shows/the-wire.json


In [3]:
import pandas as pd
import warnings

pd.set_option('display.max_colwidth', None)
warnings.filterwarnings('ignore')

## Deserializing json

the following function takes a URL as input and returns deserialized json as output.


In [None]:
def get_json(url: str):
    import requests
    response = requests.get(url)
    response.raise_for_status()
    return response.json()


This example deserializes reddit news and then finds the key where the articles are and displays the first artcle.

In [None]:
reddit = get_json("https://raw.githubusercontent.com/mafudge/datasets/master/json-samples/reddit.json")
articles = reddit["data"]["children"]
articles[0]

In [None]:
# PROMPT 1 read in episodes of "The Wire" find the episodes key, and show the first episode.


## Creating dataframe

In this example we use`json_normalize()` to display the articles in a dataframe

In [None]:
art_df = pd.json_normalize(articles) #articles is a list of dict!
art_df.head(1)

In [None]:
# PROMPT 2 display the episodes as a dataframe


## A Text Sentiment Web Service

The following function uses the http://text-processing.com service to calculate the sentiment for any input text. The result is a dict with probabilities and overall sentiment. For example:

INPUT: `"very nice"`

OUTPUT: 
```
{
    'probability': {
        'neg': 0.28997418956645504,
        'neutral': 0.13591527211268692,
        'pos': 0.710025810433545
    },
    'label': 'pos'
}


In [1]:
def get_text_sentiment(text: str) -> dict:
    import requests
    response = requests.post(
        "http://text-processing.com/api/sentiment/", 
        data={"text": text},
        headers={"User-Agent": "Mozillia/5.0"}
    )
    response.raise_for_status()
    return response.json()


sentiment = get_text_sentiment("very nice")
sentiment

{'probability': {'neg': 0.28997418956645504,
  'neutral': 0.13591527211268692,
  'pos': 0.710025810433545},
 'label': 'pos'}

In [2]:
# PROMPT 3: get the sentiment of the text below and print the label only
text = "The layover for my flight was very long and it was cold on the plane and I had a stomach ache"


{'probability': {'neg': 0.6701191630651346,
  'neutral': 0.0828566177589778,
  'pos': 0.3298808369348654},
 'label': 'neg'}

## Using a lambda with `apply()`

The following example creates a new Series in the dataframe called `"title_sentiment"` which calculates the sentiment of the reddit news title.

In [None]:
art_df["title_sentiment"] = art_df.apply(lambda row: get_text_sentiment(row["data.title"]), axis=1)
art_df[["data.title", "title_sentiment"]].head()

In [None]:
# PROMPT 4 get the sentiment for the summary Series in the episodes dataframe, name the column "summary_sentiment" and show the summary and the sentiment


## Saving the dataframe 

Calculating the sentiment isn't free and we don't need to do it more than once. Its a good idea to save the content of our dataframe at this point.  That way we can continue from the saved file versus re-calculating the transformation.

After you run this code open the `reddit_news_articles.csv` file and check it out!

In [None]:
art_df.to_csv("reddit_news_articles.csv", index=False)

In [None]:
# PROMPT 5 save your episodes of the wire to "the_wire_episodes.csv"


## Hot and Top articles.

You have been hired as a data scientist at Reddit. Through your extensive statistical research, you have determined: 

- Articles where the number of comments are in the 75% percentile should be categorized as `"Hot"`
- Articles where the number of upvotes are in the 75% percentile should be categorized as `"Popular"`
- Articles matching both should be categories as `"Fire"`

In these next challenges, we will write code to implement these rules.


In [None]:
# PROMPT 6 write code to show the statistics of the numerical columns in the `art_df`
art_df.describe()

In [None]:
# PROMPT 7 which columns represent comments? upvotes? What are the thresholds based on the 75% percentile?
hot_comments_minimum = art_df.describe().loc["75%", "data.num_comments"]
pop_comments_minimum = art_df.describe().loc["75%", "data.ups"]


In [None]:
# PROMPT 8 Write code to extract the comments_threshold


In [None]:
# PROMPT 9 Write code to extract the upvote_threshold


### defining the `categorize()` function:

Let's review the requirements: 

- Articles where the number of comments are in the 75% percentile should be categorized as "Hot"
- Articles where the number of upvotes are in the 75% percentile should be categorized as "Popular"
- Articles matching both should be categories as "Fire"

INPUTS (4): 

    PROMPT 10
    
OUTPUTS (1):

    PROMPT 11

In [None]:
# PROMPT 12 write function definition


In [None]:
# PROMPT 13a write test for "Hot"
upvotes = 0
upvote_threshold = 0
comments = 0
comments_threshold = 0
expect = "????"
actual = categorize(upvotes, upvote_threshold, comments, comments_threshold)
print(f"When upvotes={upvotes}/{upvote_threshold}, comments={comments}/{comments_threshold}, EXPECT={expect}, ACTUAL={actual}")
assert expect == actual

In [None]:
# PROMPT 13b write test for "Popular"
upvotes = 0
upvote_threshold = 0
comments = 0
comments_threshold = 0
expect = "????"
actual = categorize(upvotes, upvote_threshold, comments, comments_threshold)
print(f"When upvotes={upvotes}/{upvote_threshold}, comments={comments}/{comments_threshold}, EXPECT={expect}, ACTUAL={actual}")
assert expect == actua

In [None]:
# PROMPT 13c write test for "Fire"
upvotes = 0
upvote_threshold = 0
comments = 0
comments_threshold = 0
expect = "????"
actual = categorize(upvotes, upvote_threshold, comments, comments_threshold)
print(f"When upvotes={upvotes}/{upvote_threshold}, comments={comments}/{comments_threshold}, EXPECT={expect}, ACTUAL={actual}")
assert expect == actua

In [None]:
# PROMPT 13d write test for ""
upvotes = 0
upvote_threshold = 0
comments = 0
comments_threshold = 0
expect = "????"
actual = categorize(upvotes, upvote_threshold, comments, comments_threshold)
print(f"When upvotes={upvotes}/{upvote_threshold}, comments={comments}/{comments_threshold}, EXPECT={expect}, ACTUAL={actual}")
assert expect == actua

## Final step. lambda / apply

Add a column to the dataframe called `"category"` which calculates the category for the article 

output the article title, the number of comments, number of upvotes and the category

In [None]:
# PROMPT 14
