# INFO 2950 Homework 1

This week, our goal is to (re-)familiarize ourselves with the basics of Python, and start getting to know the `pandas` package, which helps us work with data frames. We will also discuss cleaning data. 

## Part 1: Discussion Question

As a data scientist, you'll be asked to build various tools and systems all the time. While you might be *capable* of performing these tasks, it's always worth asking yourself: should you?

Below is a list of scenarios. In each case, answer the discussion question: **If you were to build this system, how might its use affect people (be it positively or negatively)?**

### Scenarios
1. You work for an e-commerce site that wants to direct customers to items they would likely be interested in and avoiding serving them irrelevant ads. You've been asked to build an advertisement system that recommends products to customers based on a variety of attributes, including their age, income level, locale, gender, etc.
1. You work for a news site that lives or dies from ad revenue. Advertisers don't want their ads next to "controversial" news items. You've been asked to build a classifier for "touchy subjects".
1. You work for a public health agency that wants to identify people exposed to a disease for testing. You've been asked to build a system that will trace phone locations and estimate proximity between individuals to infer exposure.
1. You work for a university that wants to identify students having difficulties with stress and mental health. You've been asked to build a system that will identify students who aren't going to class.


---

## Problem 1 (6 pts)
*Remember not to move or modify problem header cells, like this one; we use it for automated homework analysis.* 

Choose **two** of the above scenarios, and in each case, answer the discussion question.

---

**Problem 3:** This system will most prominently impact the people who's phones will be tracked. If users do not give their full and proper consent to have their location tracked 24/7, then the system is in serious violation of their privacy.  This is especially true if people's location data is stored and packanged on the system and can be sold to other companies. However, the technology could also positively impact people who's location is tracked,, as they will be notified if they are a close contact and may subsequently seek testing that they wouldn't have without this technology. This could thus improve public health, and positively impact the entire community by limiting the spread of the disease.

**Problem 1** This system could positively impact customers who receive tailored ads as they may discover products that are relevant to them that they wouldn't have without the system. However, as with the previous problem, if user's sensitive data such as age and income level is stored and can be packaged and resold to third parties, then this could negatively impact people because their data is being used without their knowledge or consent. This system will also positively impact the workers of the e-commerce site, as their sales revenue could drastically increase since the system is showing customers the proucts they are most likely to purchase.

---

## Part 2: Python Basics

Twitter is full of controversies, but a recent one over whether or not 2 + 2 = 5 sparked a lot of discussion in the math and stats world. Here's [a great thread](https://twitter.com/kareem_carr/status/1289724475609501697) on this issue.

A key point from this thread: 

_"Our numbers, our quantitative measures, are abstractions of real underlying things in the universe and it's important to keep track of this when we use numbers to model the real world."_

So what does this Twitter controversy have to do with Python...? Let's try to compute 2 + 2 in a few ways and see what happens.


In [2]:
## example 1
x = 2
print("2 + 2 =", x + x)

## example 2
y = 2.0
print("2 + 2 =", y + y)

## example 3
z = "2"
print("2 + 2 =", z + z)

2 + 2 = 4
2 + 2 = 4.0
2 + 2 = 22


2 + 2 = ... 22? What's going on here?

---

## Problem 2 (4 pts)

What is the difference between each of the three examples of arithmetic above? In particular, explain why example 3 produced "2 + 2 = 22". 

---

In the first example, since x is set to be integer, while in 2, y is a float. In example 3, Z is set to be a string, and therefore Python is performing string concatenation which means it simply adds the 2 strings together instead of performing arithmetics. 

---

We can perform any number of simple arithmetic operations with Python, but its power lies in our ability automate simple tasks with a few lines of code. In doing so, for loops are often very useful.

---

## Problem 3 (6 pts)

Start with the list of integers from 0 to 15 (inclusive). Write a for loop that iterates through this list to create a new list, where three has been added to each integer in the original list. Print the final list.

*(**Hint:** use the [`range`](https://python-reference.readthedocs.io/en/latest/docs/functions/range.html?highlight=range) function to iterate over in your for loop.)*

*(**Confidence check:** your code should print `[3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18]`.)* 

---

In [3]:
x=[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15]
y=[]
for i in x:
    y.append(i+3)
print(y)

[3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18]


---

For loops are very literal, which makes them straightforward to understand and write. However, they can get a little verbose and they are not always the most computationally efficient.

List comprehensions are a sleek alternative that you will (hopefully) come to love in this course. Their syntax is designed to be simple. It looks like this:

`[`expression `for` item `in` list`]`

For instance, the for loop you wrote in problem 3 can be written succinctly with a list comprehension:

In [4]:
new_list_of_ints = [i+3 for i in range(16)]
print(new_list_of_ints)

[3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18]


You can also include `if` conditions on the original list in the list comprehension...

---

## Problem 4 (4 pts)

Copy the above list comprehension and add an if statement (within the list comprehension) after `range(16)` so that the list it creates includes only integers that leave a remainder of 2 when divided by 3. The `i+3 for i in range(16)` part of the list comprehension should remain completely unmodified in your solution.   

*(**Hint:** You may want to make use the [modulus](https://python-reference.readthedocs.io/en/latest/docs/operators/modulus.html) operator)*

*(**Confidence check:** your code should print `[5, 8, 11, 14, 17]`.)* 

---

In [5]:
new_list_of_ints = [i+3 for i in range(16) if i%3==2]
print(new_list_of_ints)

[5, 8, 11, 14, 17]


---

## Part 3: `pandas` and Data Cleaning

The [`pandas`](https://pandas.pydata.org/) package is used to perform data analysis. It is essentially a package for working with data frames (i.e. tables of data).

In [6]:
import pandas as pd

We could simply write `import pandas`, but then every time we want to refer to a function in the `pandas` package (e.g. one called `my_function`), we would have to always write `pandas.my_function` to refer to it, and this gets cumbersome. A common way to abbreviate "pandas" is "pd", which is why we imported it the way we did above. This way, we can refer to functions in the `pandas` package like `pd.my_function` instead.

Real world data is often messy and needs a bit of pre-processing or "cleaning" before it can by used (easily) for analysis. There is no well-defined list of steps to go through when cleaning data, because it depends on what your raw data looks like to begin with.

The data we'll work with now is a set of makeup reviews from the Sephora online store. The original data can be found [here](https://github.com/everestpipkin/datagardens/tree/master/students/khanniie/5_newDataSet), though we've manually altered the data to help illustrate some data cleaning principles.

### Data cleaning plan

Before we roll up our sleeves and get to cleaning the data, we need to devise a plan:

**Step 1:** *Figure out the start point* - what does our (raw) data look like now?

**Step 2:** *Set analysis goals* - what kind of analyses might we want to run based on the available data?

**Step 3:** *Define the end point* - what form do we want the data in for analysis?

**Step 4:** *Make a task list* - what do we need to do to get our data from the start to the end point?

Repeat these steps as you come up with more analysis ideas! It won't necessarily happen linearly.

Whenever I'm cleaning data, I like to have the `pandas` [cheat sheet](https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf) open, because I need to reference it embarassingly often :)

---

## Problem 5 (4 pts)

Perform step 1 in the data cleaning plan by loading the data file (`all_reviews.csv`) using `panda`'s `read_csv` function. Save the data to a variable called `raw_reviews` and display the first few rows of the of the data using the [`.head()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.head.html) data frame method.

(*Remember:* The syntax for object-specific methods looks like `variable_name.method_name()`.)

---

In [7]:
raw_reviews=pd.read_csv('all_reviews.csv')
raw_reviews.head()

Unnamed: 0,Brand,Product Name,Product Type,Review Body,Review Date,Review Stars,Review Title,URL,User ID
0,Dior,Diorshow Waterproof Mascara,Mascara,This is my first Sephora product and was recom...,01 01 2011,5 stars,Favorite! :),https://www.sephora.com/product/diorshow-water...,6.19e+16
1,Kat Von D,Tattoo Liner,Liquid Eyeliner,What can I say. I bought it in Dec. but havent...,01 01 2011,5 stars,Best Yet,https://www.sephora.com/product/tattoo-liner-P...,-8.33e+17
2,Urban Decay,Cannonball Ultra Waterproof Mascara,Mascara,Pros: Goes on easy. Looks good. Waterproof. Co...,01 01 2012,1 star,Failed,https://www.sephora.com/product/cannonball-ult...,-3.34e+18
3,Urban Decay,Eyeshadow Primer Potion - Original,Eye Primer,I can't believe how I had to suffer years of g...,01 01 2012,5 stars,Absolute Must Have,https://www.sephora.com/product/eyeshadow-prim...,1.08e+17
4,Urban Decay,All Nighter Long-Lasting Makeup Setting Spray,Setting Spray,Worth it. Keeps makeup looking freshly applied...,01 01 2013,5 stars,Perfect setting spray,https://www.sephora.com/product/all-nighter-lo...,-9.21e+18


---

(This data is already a lot cleaner than you might see in practice, we want to start slow!)

**Step 2:** Based on the data that's available, here are some concrete analysis goals (to start!) for this data set:

1. compare ratings for products of the same type
1. look at the trend in ratings over time for a specific product

**Step 3:** So what do we need the clean data to look like for the above analysis? Let's sketch the header for our dream table:

|product_type|brand|product_name|review_stars|review_date|
|------------|-----|------------|------------|-----------|
|str|str|str|int|date|

You'll notice we've written our desired column names as lowercase strings with underscores instead of spaces. Lowercase is easier to type when programming, and spaces in column names prevent us from using some syntax to quickly refer to columns. 

**Step 4:** Now how do we get from the raw data to clean data? It helps to split the tasks into two parts: cleaning column names and cleaning entries.

Let's first clean up the column names and then take a look at the entries to see what needs to be done there. 

### Task List 1: Column  names
1. convert column names to lowercase and replace spaces with underscores
1. drop unnecessary columns

---

## Problem 6 (4 pts)

Use a list comprehension to create a list called `new_colnames` that takes the old column names (stored in `raw_reviews.columns`) and converts them to lowercase. Print `new_colnames`.

*(**Hint:** You may find the string method [`.lower()`](https://www.w3schools.com/python/ref_string_lower.asp) helpful here)*

---

In [8]:
new_colnames = [i.lower() for i in raw_reviews.columns]
print(new_colnames)

['brand', 'product name', 'product type', 'review body', 'review date', 'review stars', 'review title', 'url', 'user id']


---

## Problem 7 (8 pts)

Use another list comprehension update the `new_colnames` list, replacing all spaces in the list entries with underscores. Print the updated `new_colnames` list.

Copy `raw_reviews` to a new variable called `reviews` using the [`.copy()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.copy.html) data frame method. Update the column names for the `reviews` data frame using your cleaned column names stored in `new_colnames`. Show the first few rows of `reviews`.

*(**Hint:** you may find the string method [`.replace()`](https://www.w3schools.com/python/ref_string_replace.asp) helpful here)*

---

In [9]:
reviews=raw_reviews.copy()
reviews.columns=new_colnames
new_colnames=[i.replace(' ','_') for i in new_colnames]
print(new_colnames)
reviews.columns=new_colnames
reviews.head()

['brand', 'product_name', 'product_type', 'review_body', 'review_date', 'review_stars', 'review_title', 'url', 'user_id']


Unnamed: 0,brand,product_name,product_type,review_body,review_date,review_stars,review_title,url,user_id
0,Dior,Diorshow Waterproof Mascara,Mascara,This is my first Sephora product and was recom...,01 01 2011,5 stars,Favorite! :),https://www.sephora.com/product/diorshow-water...,6.19e+16
1,Kat Von D,Tattoo Liner,Liquid Eyeliner,What can I say. I bought it in Dec. but havent...,01 01 2011,5 stars,Best Yet,https://www.sephora.com/product/tattoo-liner-P...,-8.33e+17
2,Urban Decay,Cannonball Ultra Waterproof Mascara,Mascara,Pros: Goes on easy. Looks good. Waterproof. Co...,01 01 2012,1 star,Failed,https://www.sephora.com/product/cannonball-ult...,-3.34e+18
3,Urban Decay,Eyeshadow Primer Potion - Original,Eye Primer,I can't believe how I had to suffer years of g...,01 01 2012,5 stars,Absolute Must Have,https://www.sephora.com/product/eyeshadow-prim...,1.08e+17
4,Urban Decay,All Nighter Long-Lasting Makeup Setting Spray,Setting Spray,Worth it. Keeps makeup looking freshly applied...,01 01 2013,5 stars,Perfect setting spray,https://www.sephora.com/product/all-nighter-lo...,-9.21e+18


---

## Problem 8 (4 pts)

Update the `reviews` data frame, keeping only the following columns:

```
product_type, brand, product_name, review_stars, review_date
```

Show the first few rows of the updated `reviews` data frame.

---

In [10]:
reviews=reviews.drop(['review_body', 'url', 'user_id', 'review_title'], axis=1)
reviews.head()

Unnamed: 0,brand,product_name,product_type,review_date,review_stars
0,Dior,Diorshow Waterproof Mascara,Mascara,01 01 2011,5 stars
1,Kat Von D,Tattoo Liner,Liquid Eyeliner,01 01 2011,5 stars
2,Urban Decay,Cannonball Ultra Waterproof Mascara,Mascara,01 01 2012,1 star
3,Urban Decay,Eyeshadow Primer Potion - Original,Eye Primer,01 01 2012,5 stars
4,Urban Decay,All Nighter Long-Lasting Makeup Setting Spray,Setting Spray,01 01 2013,5 stars


---

Now that our column names are in order, let's look at what kind of cleaning our entries need.

---
## Problem 9 (6 pts)

Write a for loop that will, for __each column__ in `reviews`,
1. print the column name
1. print the count of observations for each type in the given column
1. print an empty line break (`"\n"`) to make the results produced by the for loop more readible.

*(**Hint:** For step 2 in the for loop, you may find the `pandas` column method [`.value_counts()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.value_counts.html?highlight=value_counts#pandas.Series.value_counts) helpful.)*

---

In [11]:
for i in reviews:
    print(i);
    prints=reviews[i].value_counts()
    print(prints)
    print("\n")
    

brand
Urban Decay                1033
Marc Jacobs                 584
Kat Von D                   567
FENTY BEAUTY BY RIHANNA     529
Stila                       418
Dior                        407
Too Faced                   402
Tom Ford                     26
Name: brand, dtype: int64


product_name
Highliner Gel Eye Crayon Eyeliner                584
Tattoo Liner                                     567
Flyliner Longwear Liquid Eyeliner                529
Stay All Day® Waterproof Liquid Eye Liner        418
Diorshow Waterproof Mascara                      407
Hangover Replenishing Face Primer                402
All Nighter Long-Lasting Makeup Setting Spray    391
Cannonball Ultra Waterproof Mascara              323
Eyeshadow Primer Potion - Original               319
Emotionproof Mascara                              26
Name: product_name, dtype: int64


product_type
Liquid Eyeliner    1514
Mascara             752
Gel Eyeliner        584
Primer              401
Setting Spray       391

---

## Problem 10 (4 pts)

Look at the results from the previous exercise for the `product_type` column. What do you notice about some of the entries that might be a problem for data analysis? What problem would this pose? 

---

Although they refer to the same product, some entries contain all uppercase letters while others don't (ie. primer and Primer refer to the same product, but they won't be recognized by Python as being in the same product type catagory). This  would be a problem for data analysis purposes, as we would not get the correct analysis of the primer and mascara products, since some entries won't be included in the analysis. For example, if we want to calculate the number of primer entries, we wont get an innacurate number.

---

Inspecting the entries, we should notice a few things that may be a problem for our analysis plan:
* some entries in the `product_type` column are inconsistent (e.g. `Mascara` and `MASCARA`)
* there appear to be some observations with missing values in the `review_stars` column (some observations have `???` in this column)
* the dates don't really look like dates and may not be parsed correctly
    
Using these observations, we can write our second task list for cleaning the table entries.

### Task List 2: Entries
1. convert entries in the `product_type` column to lowercase, to get rid of duplicate categories arrising from case differences
1. remove observations with missing values in the `review_stars` column
1. convert the `review_dates` column to an actual `pandas` date object

---

## Problem 11 (4 pts)

Update `reviews` by converting the `product_type` column to lowercase. Display the first few rows of the updated `reviews`.

When updating a subset of a `pandas` dataframe, be sure to use the [`.loc[]`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.loc.html?highlight=loc#pandas.DataFrame.loc) data frame method on the left-hand side of the assignment statement. You should **not** get a warning labelled `SettingWithCopyWarning` if you've done this correctly.

---

In [12]:
for i in range(len(reviews.index)):
    reviews.loc[i, 'product_type']=reviews.loc[i, 'product_type'].lower()
reviews.head()


Unnamed: 0,brand,product_name,product_type,review_date,review_stars
0,Dior,Diorshow Waterproof Mascara,mascara,01 01 2011,5 stars
1,Kat Von D,Tattoo Liner,liquid eyeliner,01 01 2011,5 stars
2,Urban Decay,Cannonball Ultra Waterproof Mascara,mascara,01 01 2012,1 star
3,Urban Decay,Eyeshadow Primer Potion - Original,eye primer,01 01 2012,5 stars
4,Urban Decay,All Nighter Long-Lasting Makeup Setting Spray,setting spray,01 01 2013,5 stars


---

## Problem 12 (6 pts)

Update `reviews` by removing any rows where the `review_stars` entry is `???`. Display the first few rows of the updated `reviews`.

When updating an entire `pandas` dataframe, be sure to use the [`.copy()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.copy.html?highlight=copy#pandas.DataFrame.copy) data frame method on the right-hand side of the assignment statement.

*(**Hint:** the `pandas` [cheat sheet](https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf) gives syntax for subsetting observations, which may be helpful here. We want to filter out any rows where the `reviews_stars` entry is `???`.)*

*(**Confidence check:** Before filtering out rows with `???` in the `review_stars` column, `reviews` should have 3966 observations. After filtering, `reviews` should have 3964 observations. You can check the number of rows by printing the `.shape` atribute of a dataframe, i.e. using `reviews.shape` here.)*

---

In [17]:
final_reviews=reviews.copy()
final_reviews=final_reviews[final_reviews.review_stars !='???']
reviews=final_reviews
reviews.head()

Unnamed: 0,brand,product_name,product_type,review_date,review_stars
0,Dior,Diorshow Waterproof Mascara,mascara,2011-01-01,5 stars
1,Kat Von D,Tattoo Liner,liquid eyeliner,2011-01-01,5 stars
2,Urban Decay,Cannonball Ultra Waterproof Mascara,mascara,2012-01-01,1 star
3,Urban Decay,Eyeshadow Primer Potion - Original,eye primer,2012-01-01,5 stars
4,Urban Decay,All Nighter Long-Lasting Makeup Setting Spray,setting spray,2013-01-01,5 stars


---

Note that often missing values will appear as null values (as opposed to something like `???`). Null values depend on the underlying data type: `NaN` in numeric arrays, `None` or `NaN` in object arrays, `NaT` in datetime arrays. The functions [`isnull()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.isnull.html) and [`notnull()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.notnull.html?highlight=notnull#pandas.DataFrame.notnull) are helpful for filtering in these cases.  You should **not** use logical operators like `==` or `!=` to check for null values, as such expressions often have [unintended consequences](https://towardsdatascience.com/navigating-the-hell-of-nans-in-python-71b12558895b).

That said, it is worth asking yourself why you're removing rows with missing values and if it might affect some downstream analysis. For example, if we were interested in the *number* of reviews per product, removing products with missing ratings would artificially reduce these sample sizes.

Now, we just have to tackle the dates and we will have accomplished all of our data cleaning tasks!

Let's take a closer look at the dates, as they currently appear in `reviews`:

In [14]:
type(reviews.review_date[0])

str

The review dates are currently stored as strings in `reviews`. Python does not inately know a special date string from any other string (functionally, it thinks of "2020-09-07" and "hello" as the same type of data). Information about how dates should be read, ordered, and how to perform calculations with dates is not encoded when a date is stored as a string.

Instead, we must represent dates as specific "date" types in Python. To do this, we have to tell Python how to parse the date string to create a special date object. 

---

## Problem 13 (6 pts)

Update `reviews` by overwriting the exisiting string entries in the `review_dates` column with correctly parsed date entries. Use the `pandas` [`to_datetime()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.to_datetime.html) function to convert the dates in the existing `review_date` column to ones that are correctly parsed using the `format` argument.

Display the first few rows of the updated `reviews`. Print the data type of the first entry of the updated `review_date` column.

*(**Hint:** The original dates in the `review_date` column are of the form `dd mm yyyy`. In writing the `format` string that parses this formatting, you may find it helpful to consult [this chart](https://docs.python.org/3/library/datetime.html#strftime-and-strptime-format-codes) of date and time codes recognized by `format`. For example, `format = "%b %d, %y"` would capture dates like `Jul 15, 91`.)*

---

In [15]:
reviews['review_date']=pd.to_datetime(reviews['review_date'],format="%d %m %Y")
display(reviews.head())
print(type(reviews.review_date[0]))


Unnamed: 0,brand,product_name,product_type,review_date,review_stars
0,Dior,Diorshow Waterproof Mascara,mascara,2011-01-01,5 stars
1,Kat Von D,Tattoo Liner,liquid eyeliner,2011-01-01,5 stars
2,Urban Decay,Cannonball Ultra Waterproof Mascara,mascara,2012-01-01,1 star
3,Urban Decay,Eyeshadow Primer Potion - Original,eye primer,2012-01-01,5 stars
4,Urban Decay,All Nighter Long-Lasting Makeup Setting Spray,setting spray,2013-01-01,5 stars


<class 'pandas._libs.tslibs.timestamps.Timestamp'>


---

That's it! We've accomplished our data cleaning task list. Of course, as you begin to work with your data, you may notice other idiosyncrasies that require further cleaning.

For instance, the entries in the `review_stars` column are still strings, but in order to perform certain kinds of analyses (e.g. get the mean star rating per item), we really want this data to be stored as a numeric type... How might one use Python to extract the numeric value of the rating? More on this later, when we start learning about [regular expressions](https://www.programiz.com/python-programming/regex)... 