# Chapter 4 - Scattertext

## Instructions

- Run the cells with "assert" statements to see if your answer's output matches what the output should be. If it runs without error, your answer matches! If your output is different, you'll get a hint.

Uncomment and run the line of code below to install scattertext if you have not already done so.

In [1]:
#!pip install scattertext

In [2]:
import pandas as pd
import scattertext as st

Please perform calculations on this dataframe called `df` for the exercises in this chapter.  These data include [reviews for AirBnB properties](https://www.kaggle.com/broach/denverairbnb?select=reviews.csv) in the Denver, Colorado, USA, area.

In [3]:
df = pd.read_csv('https://github.com/kimfetti/Projects/blob/master/Etc/airbnb_reviews10K.csv?raw=True')

In [4]:
df['date'] = pd.to_datetime(df.date)

In [5]:
df['summer'] = df.date.apply(lambda x: x.month in [6, 7, 8]).map({True: "Summer", False: "Not Summer"})

In [6]:
df.sample(5, random_state=10)

Unnamed: 0,listing_id,id,date,reviewer_id,reviewer_name,comments,summer
937,15330451,512737596,2019-08-18,3023810,Andi,Tim's house is gorgeous and perfect for our gr...,Summer
9355,28881584,553656965,2019-10-24,138779484,Alex,"Great location, very clean, cozy, and reasonab...",Not Summer
2293,9496966,484905930,2019-07-09,183541615,Kate,Kinga's place was spacious with great views of...,Summer
192,30007294,548473063,2019-10-17,259657102,Joann,It was a great experience,Not Summer
8675,13526515,488214568,2019-07-14,55695037,James,"Great location, easy check in and checkout, an...",Summer


# Exercise 1

In the code above, we converted the "date" column to be datetime values and then created a column called "summer" that contains labels: "Summer" if the review was left during the summer months and "Not Summer" if not.

Let's use scattertext to see if summertime reviews talk about different things than other reviewers.

## Part 1 - Build the Corpus

We've already loaded in scattertext and a pandas dataframe for you, so the next step is building our scattertext corpus.

We have provided you some starter code below.  Fill in the name strings of the category column and the text column.  

In [7]:
## the starter code should look like this, but uncommented:

# corpus = st.CorpusFromPandas(
#     df,
#     category_col =  , ## YOUR ANSWER HERE
#     text_col = ,  ### YOUR ANSWER HERE
#     nlp=st.whitespace_nlp_with_sentences
# ).build()

In [8]:
### BEGIN SOLUTION

corpus = st.CorpusFromPandas(
    df,
    category_col = 'summer',
    text_col = 'comments',
    nlp=st.whitespace_nlp_with_sentences
).build()

### END SOLUTION

In [9]:
### CHECK YOUR OUTPUT WITH THE ANSWER
assert type(corpus) == st.CorpusDF, "Be sure to create your corpus from the dataframe provided.  It should be a scattertext CorpusDF object."
assert corpus.get_num_docs() == len(df), "Your corpus should be constructed from df and should have the same number of documents as df."
assert corpus.get_categories() == ['Not Summer', 'Summer'], "The categories of your corpus should be 'Not Summer' and 'Summer'."

## Part 2 - Create the HTML

And now we will create our scattertext HTML.  We have provided you with more starter code below.  You will need to fill in:

1. category: This is the string of the category that you would like to explore from the category column.  Since we are exploring the "summer" column, category will either be "Summer" or "Not Summer" -- your choice.  (By the way, if we had three categories, scattertext would group any other categories together.  It's basically this category vs. NOT this category.)

2. category_name:  This is another string.  What would you like to name your category axis?

3. not_category_name:  This is another string.  What would you like to name the other axis on your scattertext plot?  

In [10]:
# ### the starter code should look like this, but uncommented:

# html = st.produce_scattertext_explorer(
#         corpus,
#         category= , ### YOUR ANSWER HERE
#         category_name=  , ### YOUR ANSWER HERE
#         not_category_name= , ### YOUR ANSWER HERE
#         minimum_term_frequency=10,
#         pmi_threshold_coefficient=5,
#         width_in_pixels=1000,
#         metadata=df['reviewer_name'],
#         )

In [11]:
### BEGIN SOLUTION

html = st.produce_scattertext_explorer(
        corpus,
        category="Summer",
        category_name='Summer',
        not_category_name='Not Summer',
        minimum_term_frequency=10,
        pmi_threshold_coefficient=5,
        width_in_pixels=1000,
        metadata=df['reviewer_name'],
        )

### END SOLUTION

## Part 3 - Explore the HTML

Now it's time to save your HTML and download this file to your computer.  Execute the cells below which will create and download an HTML file called "airbnb_reviews.html" to the current directory you are working from.  

In [12]:
open('airbnb_reviews.html', 'wb').write(html.encode('utf-8'));

That's it!  You just created your first scattertext.  Now it's time to explore.  Open the file you just downloaded -- a simple double click should do the trick.  Give the file a few minutes to load into your browser and then answer the following questions:

1.  What is the top characteristic word for the "Summer" group (the first word in the summer list)?  Does this surprise you?

2. Now click on this most popular word and scroll down.  You should be able to see specific examples of this "word" used in the AirBnB reviews.

3. There were about half as many "Summer" reviews than "Not Summer" reviews.  How does scattertext seem to handle this?

4. Scattertext does allow word phrases to be considered terms.  Can you find at least one two-word term?  Any theories on what two-word pairs are considered terms and get their own scatter point?

5. What is the top word that is characteristic of this corpus but not as popular in a standard Engligh corpus?  Does this surprise you?

6. Do more of the "Summer" or "Not Summer" reviews mention "snow"?  If you can't find this term, you can search for it in the search bar that says "Search this chart".

7. We passed `df["reviewer_name"]` to the metadata argument of `.produce_scattertext_explorer()`.  Where does this information show up on the scattertext plot?

8.  What other interesting insights can you find?

