# Assignment: Data Wrangling
## `! git clone https://github.com/DS3001/wrangling`
## Do Q2, and one of Q1 or Q3.

**Q1.** Open the "tidy_data.pdf" document in the repo, which is a paper called Tidy Data by Hadley Wickham.

  1. Read the abstract. What is this paper about?
  2. Read the introduction. What is the "tidy data standard" intended to accomplish?
  3. Read the intro to section 2. What does this sentence mean: "Like families, tidy datasets are all alike but every messy dataset is messy in its own way." What does this sentence mean: "For a given dataset, it’s usually easy to figure out what are observations and what are variables, but it is surprisingly difficult to precisely define variables and observations in general."
  4. Read Section 2.2. How does Wickham define values, variables, and observations?
  5. How is "Tidy Data" defined in section 2.3?
  6. Read the intro to Section 3 and Section 3.1. What are the 5 most common problems with messy datasets? Why are the data in Table 4 messy? What is "melting" a dataset?
  7. Why, specifically, is table 11 messy but table 12 tidy and "molten"?
  8. Read Section 6. What is the "chicken-and-egg" problem with focusing on tidy data? What does Wickham hope happens in the future with further work on the subject of data wrangling?

**Q2.** This question provides some practice cleaning variables which have common problems.
1. Numeric variable: For `./data/airbnb_hw.csv`, clean the `Price` variable as well as you can, and explain the choices you make. How many missing values do you end up with? (Hint: What happens to the formatting when a price goes over 999 dollars, say from 675 to 1,112?)
2. Categorical variable: For the `./data/sharks.csv` data covered in the lecture, clean the "Type" variable as well as you can, and explain the choices you make.
3. Dummy variable: For the pretrial data covered in the lecture, clean the `WhetherDefendantWasReleasedPretrial` variable as well as you can, and, in particular, replace missing values with `np.nan`.
4. Missing values, not at random: For the pretrial data covered in the lecture, clean the `ImposedSentenceAllChargeInContactEvent` variable as well as you can, and explain the choices you make. (Hint: Look at the `SentenceTypeAllChargesAtConvictionInContactEvent` variable.)

In [5]:
import pandas as pd
import numpy as np
df = pd.read_csv('airbnb_hw.csv')
price = df['Price']
print(price.unique())
price = price.str.replace(',','')
price = price.astype(int)
df['Price'] = price
print(price.unique())
print("Missing values: ", price.isnull().sum())

['145' '37' '28' '199' '549' '149' '250' '90' '270' '290' '170' '59' '49'
 '68' '285' '75' '100' '150' '700' '125' '175' '40' '89' '95' '99' '499'
 '120' '79' '110' '180' '143' '230' '350' '135' '85' '60' '70' '55' '44'
 '200' '165' '115' '74' '84' '129' '50' '185' '80' '190' '140' '45' '65'
 '225' '600' '109' '1,990' '73' '240' '72' '105' '155' '160' '42' '132'
 '117' '295' '280' '159' '107' '69' '239' '220' '399' '130' '375' '585'
 '275' '139' '260' '35' '133' '300' '289' '179' '98' '195' '29' '27' '39'
 '249' '192' '142' '169' '1,000' '131' '138' '113' '122' '329' '101' '475'
 '238' '272' '308' '126' '235' '315' '248' '128' '56' '207' '450' '215'
 '210' '385' '445' '136' '247' '118' '77' '76' '92' '198' '205' '299'
 '222' '245' '104' '153' '349' '114' '320' '292' '226' '420' '500' '325'
 '307' '78' '265' '108' '123' '189' '32' '58' '86' '219' '800' '335' '63'
 '229' '425' '67' '87' '1,200' '158' '650' '234' '310' '695' '400' '166'
 '119' '62' '168' '340' '479' '43' '395' '144' '52' 

**After further examination of the Price variable, it can be seen that the values are stored as strings with commas every three digits. To clean the variable, I removed the commas and type-casted it to an int, and there were no missing values.**

In [13]:
df = pd.read_csv('sharks.csv', low_memory = False)
print(df['Type'].value_counts())
temp = df['Type']
temp = temp.replace(['Invalid', 'Questionable','Unconfirmed','Unverified','Under investigation'], np.nan)
temp = temp.replace(['Boat', 'Boating', 'Boatomg', 'Watercraft'], 'Watercraft') # had to add watercraft as seen in solution
df['Type'] = temp
df['Type'].value_counts()

Unprovoked             4716
Provoked                593
Invalid                 552
Sea Disaster            239
Watercraft              142
Boat                    109
Boating                  92
Questionable             10
Unconfirmed               1
Unverified                1
Under investigation       1
Boatomg                   1
Name: Type, dtype: int64


Unprovoked      4716
Provoked         593
Watercraft       344
Sea Disaster     239
Name: Type, dtype: int64

**To clean the Type variable, I combined all the equivalent values for an invalid value into NaN and combined the boating values and their equivalents into 'Watercraft' to consolidate that data.**

In [21]:
df = pd.read_parquet('justice_data.parquet')
temp = df['WhetherDefendantWasReleasedPretrial']
print(temp.unique())
print("Invalid values: ", len(df[df['WhetherDefendantWasReleasedPretrial'] == 9]))
# print(df.head())
temp = temp.replace(9, np.nan)
df['WhetherDefendantWasReleasedPretrial'] = temp
print("Missing values: ", df['WhetherDefendantWasReleasedPretrial'].isnull().sum())

[9 0 1]
Invalid values:  31
Missing values:  31


In [32]:
df['ImposedSentenceAllChargeInContactEvent'].value_counts()
impose = df['ImposedSentenceAllChargeInContactEvent']
types = df['SentenceTypeAllChargesAtConvictionInContactEvent']
impose = pd.to_numeric(impose,errors='coerce') # could not use astype(int) here
print(impose.value_counts())
null_vals = impose.isnull()
print(null_vals.sum())
# impose = impose.replace(np.nan, 0)
print(impose.isnull().sum())
print(pd.crosstab(null_vals, types), '\n') # "Category 4 is cases where the charges were dismissed"

impose = impose.mask(types == 4, 0) # dismissed cases have a sentence length of 0
impose = impose.mask(types == 9, np.nan) # not meaningful
null_vals = impose.isnull()
print(null_vals.sum())

df['ImposedSentenceAllChargeInContactEvent'] = impose

0.000000     4953
12.000000    1404
0.985626     1051
6.000000      809
3.000000      787
             ... 
49.971253       1
57.034908       1
79.926078       1
42.164271       1
1.657084        1
Name: ImposedSentenceAllChargeInContactEvent, Length: 483, dtype: int64
9053
9053
SentenceTypeAllChargesAtConvictionInContactEvent     0     1    2     4    9
ImposedSentenceAllChargeInContactEvent                                      
False                                             8720  4299  914     0    0
True                                                 0     0    0  8779  274 

274


**Cleaning the imposed sentence variable was notably more difficult than cleaning the others because it required a stronger understanding of the situation at hand. At first, the values were coerced to integers to work with a more meaningful type of data. Then, a few of the values had to be changed, as 9 was not a meaningful value, and NaN values technically meant there was no sentence served, which can be denoted at duration of 0 years.**

**Q3.** Many important datasets contain a race variable, typically limited to a handful of values often including Black, White, Asian, Latino, and Indigenous. This question looks at data gathering efforts on this variable by the U.S. Federal government.

1. How did the most recent US Census gather data on race?
2. Why do we gather these data? What role do these kinds of data play in politics and society? Why does data quality matter?
3. Please provide a constructive criticism of how the Census was conducted: What was done well? What do you think was missing? How should future large scale surveys be adjusted to best reflect the diversity of the population? Could some of the Census' good practices be adopted more widely to gather richer and more useful data?
4. How did the Census gather data on sex and gender? Please provide a similar constructive criticism of their practices.
5. When it comes to cleaning data, what concerns do you have about protected characteristics like sex, gender, sexual identity, or race? What challenges can you imagine arising when there are missing values? What good or bad practices might people adopt, and why?
6. Suppose someone invented an algorithm to impute values for protected characteristics like race, gender, sex, or sexuality. What kinds of concerns would you have?

1. The most recent US Census (2020: I couldn't find a more recent Census) gathers data on race as a "Select all that apply" with Hispanic/Latino origin being its own separate question, with the space to specify ethnicity.
2. This data is gathered to gain a better understanding of the US population, and it plays into politics and society in that it provides the information the government needs to distribute funds to states and their localities. Data quality matters to form meaningful and accurate conclusions and fulfill the original intensions of learning to better fund the communities.  
3. The Census does a good job of assessing the living situation in terms of quantity and general size of the home. It asks the general information on identity without being too specific, but for future, large-scale surveys, it could have a space to describe gender identity to better reflect the diversity of the population. However, because there is so much room for improvement with the Census, I'm not entirely sure its practices could be adopted to gather richer and more useful data. Ensuring there is no representative bias when collecting data is likely the best way to maintain its quality and accuracy.
4. The Census gathered data on sex with two options: male and female. I personally do not think it is inclusive of those who may not identify as the sex they were assigned at birth, especially because there is no space to assign gender.  
5. When it comes to cleaning data, the concerns I have about protected characteristics like sex, gender, sexual identity, and race are that, as mentioned before, the way that someone identifies themselves is extremely personal and cannot be consolidated or "cleaned" like some of the variables above (i.e. with the shark data). Removing missing values would distinguish the full representation of the population, and adjusting missing values would skew the data and add bias. These are just a few practices that will do more harm than good, and it is important to have high quality data, especially when representating real people.
6. If someone invented an algorithm to impute protected characteristics, it could be extremely inaccurate and unethical to attempt to assume anything about anyone's life. There is no definite way to impute protected characteristics, so there could be algorithmic bias in the way the data was imputed and provide invalid results.