# Assignment: Data Wrangling
## `! git clone https://github.com/DS3001/wrangling`
## Do Q2, and one of Q1 or Q3.

**Q1.** Open the "tidy_data.pdf" document in the repo, which is a paper called Tidy Data by Hadley Wickham.

  1. Read the abstract. What is this paper about?
  2. Read the introduction. What is the "tidy data standard" intended to accomplish?
  3. Read the intro to section 2. What does this sentence mean: "Like families, tidy datasets are all alike but every messy dataset is messy in its own way." What does this sentence mean: "For a given dataset, it’s usually easy to figure out what are observations and what are variables, but it is surprisingly difficult to precisely define variables and observations in general."
  4. Read Section 2.2. How does Wickham define values, variables, and observations?
  5. How is "Tidy Data" defined in section 2.3?
  6. Read the intro to Section 3 and Section 3.1. What are the 5 most common problems with messy datasets? Why are the data in Table 4 messy? What is "melting" a dataset?
  7. Why, specifically, is table 11 messy but table 12 tidy and "molten"?
  8. Read Section 6. What is the "chicken-and-egg" problem with focusing on tidy data? What does Wickham hope happens in the future with further work on the  subject of data wrangling?

Q1.1: The focus of data analysis often is how to clean it, yet, Wickham observes there is little effort put into understanding how to make data cleaning as efficient as possible, given how much time is spent doing so. She seeks to construct a framework which makes it easy and consistent to clean datasets up so that as much time can be saved as possible.

Q1.2: 80% of data analysis is spent on cleaning and preparing, so Wickham decides to put an end to the time waste. She introduces a concept of tidy data, a set of standards for structuring and dealing with datasets such that one can easily clean data without spending too much brain power or time. She lays out the rest of the sections and how they cover certain aspects of solving data cleaning.

Q1.3: The general ways in which data can be unclean is typically the same; mising values, repeated instances, incorrect formatting. There is typically never a novel way in which the data is messy, yet, every data set due to the lack of structure or standard, remains to be a challenge to clean because every data set has issues in their own unique ways. The second sentence makes reference to how given any particular dataset, we can understand quite easily what our observations are and what are variables are given context. Outside that context, there is really no limit to what can be a variable or observation, and due to this, cleaning data effectively and safely without harming the end result, requires structure.

Q1.4: Values are any sort of numeric or categorical, and they belong to a variable or observation. Variables is some set of data points which represent the property, given some value. An observation is a set of values captured across a set of attributes.

Q1.5: Each variable forms a column, each observation forms a row, and each type of observational unit forms a table.

Q1.6: Having column headers by values, instead of variable names. Having multiple variables stored in one column. Having variables stored in both rows and columns. Having multiple types of observational units stored in the same table. Having a single observational unit stored in multiple tables. Table 4 shows income as columns, but it should be a column itself. Remake the table with 'religion', 'income', and 'frequency' as columns. This change turns column names from values into proper variables. This process is called 'melting'.

Q1.7: Table 11 shows days as columns, which isn't ideal. Table 12 fixes this by making a 'date' column. Table 12(b) separates these into their own columns, making everything a true data point.

Q1.8: The issue currently is that Wickham's implementation of tidy data is inherently coupled to the tools it is related to, and it is a standard that must be conformed to, otherwise it is not possible to achieve her form. Instead, the long term goal is to bring attention to finding new methodologies for preparing and cleaning data efficiently such that one is not reliant off the particular tools she is utilizing and promoting.


**Q2.** This question provides some practice cleaning variables which have common problems.
1. Numeric variable: For `./data/airbnb_hw.csv`, clean the `Price` variable as well as you can, and explain the choices you make. How many missing values do you end up with? (Hint: What happens to the formatting when a price goes over 999 dollars, say from 675 to 1,112?)
2. Categorical variable: For the `./data/sharks.csv` data covered in the lecture, clean the "Type" variable as well as you can, and explain the choices you make.
3. Dummy variable: For the pretrial data covered in the lecture, clean the `WhetherDefendantWasReleasedPretrial` variable as well as you can, and, in particular, replace missing values with `np.nan`.
4. Missing values, not at random: For the pretrial data covered in the lecture, clean the `ImposedSentenceAllChargeInContactEvent` variable as well as you can, and explain the choices you make. (Hint: Look at the `SentenceTypeAllChargesAtConvictionInContactEvent` variable.)

In [14]:
import pandas as pd

pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)

from IPython.display import display, HTML
display(HTML("<style>.jp-OutputArea-output {display:flex}</style>"))

data = pd.read_csv("data/airbnb_hw.csv")
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30478 entries, 0 to 30477
Data columns (total 13 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   Host Id                     30478 non-null  int64  
 1   Host Since                  30475 non-null  object 
 2   Name                        30478 non-null  object 
 3   Neighbourhood               30478 non-null  object 
 4   Property Type               30475 non-null  object 
 5   Review Scores Rating (bin)  22155 non-null  float64
 6   Room Type                   30478 non-null  object 
 7   Zipcode                     30344 non-null  float64
 8   Beds                        30393 non-null  float64
 9   Number of Records           30478 non-null  int64  
 10  Number Of Reviews           30478 non-null  int64  
 11  Price                       30478 non-null  object 
 12  Review Scores Rating        22155 non-null  float64
dtypes: float64(4), int64(3), object

In [19]:
data.value_counts('Price').head(25)

Price
150    1481
100    1207
200    1059
125     889
75      873
80      798
250     747
120     743
90      729
70      711
175     705
65      696
60      683
50      643
85      623
95      558
99      558
110     541
130     457
140     457
160     449
55      437
180     399
300     397
225     384
Name: count, dtype: int64

In [22]:
data['Price'] = pd.to_numeric(data['Price'], errors='coerce')

In [23]:
data.value_counts('Price').head(25)

Price
150.0    1481
100.0    1207
200.0    1059
125.0     889
75.0      873
80.0      798
250.0     747
120.0     743
90.0      729
70.0      711
175.0     705
65.0      696
60.0      683
50.0      643
85.0      623
95.0      558
99.0      558
110.0     541
130.0     457
140.0     457
160.0     449
55.0      437
180.0     399
300.0     397
225.0     384
Name: count, dtype: int64

In [24]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30478 entries, 0 to 30477
Data columns (total 13 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   Host Id                     30478 non-null  int64  
 1   Host Since                  30475 non-null  object 
 2   Name                        30478 non-null  object 
 3   Neighbourhood               30478 non-null  object 
 4   Property Type               30475 non-null  object 
 5   Review Scores Rating (bin)  22155 non-null  float64
 6   Room Type                   30478 non-null  object 
 7   Zipcode                     30344 non-null  float64
 8   Beds                        30393 non-null  float64
 9   Number of Records           30478 non-null  int64  
 10  Number Of Reviews           30478 non-null  int64  
 11  Price                       30297 non-null  float64
 12  Review Scores Rating        22155 non-null  float64
dtypes: float64(5), int64(3), object

Q1. We coerce our Price into Numeric (float64), and this leaves us with 181 missing values.

In [6]:
import pandas as pd
import numpy as np

sd = pd.read_csv("data/sharks.csv")
sd.value_counts('Type')

  sd = pd.read_csv("data/sharks.csv")


Type
Unprovoked             4716
Provoked                593
Invalid                 552
Sea Disaster            239
Watercraft              142
Boat                    109
Boating                  92
Questionable             10
Boatomg                   1
Unconfirmed               1
Under investigation       1
Unverified                1
Name: count, dtype: int64

In [11]:
tp = sd['Type'] 

tp = tp.replace(['Boatomg', 'Watercraft', 'Sea Disaster', 'Boat', 'Boating'], 'Boat Activity')
tp = tp.replace(['Invalid', 'Questionable', 'Unconfirmed', 'Unverified', 'Under investigation'], np.nan) 
sd['Type'] = tp

sd['Type'].value_counts()

Type
Unprovoked    4716
Provoked       593
Misc           583
Name: count, dtype: int64

Q2. The existing dataset is extremely messy and has a number of descriptors that do not need to exist, and can instead be combined for more clarity. We first take everything that is not given a specification of any sort related to unprovoked or provoked, but has a topic, and simply refer to it as a boat activity. Anything which 

**Q3.** Many important datasets contain a race variable, typically limited to a handful of values often including Black, White, Asian, Latino, and Indigenous. This question looks at data gathering efforts on this variable by the U.S. Federal government.

1. How did the most recent US Census gather data on race?
2. Why do we gather these data? What role do these kinds of data play in politics and society? Why does data quality matter?
3. Please provide a constructive criticism of how the Census was conducted: What was done well? What do you think was missing? How should future large scale surveys be adjusted to best reflect the diversity of the population? Could some of the Census' good practices be adopted more widely to gather richer and more useful data?
4. How did the Census gather data on sex and gender? Please provide a similar constructive criticism of their practices.
5. When it comes to cleaning data, what concerns do you have about protected characteristics like sex, gender, sexual identity, or race? What challenges can you imagine arising when there are missing values? What good or bad practices might people adopt, and why?
6. Suppose someone invented an algorithm to impute values for protected characteristics like race, gender, sex, or sexuality. What kinds of concerns would you have?