<a href="https://colab.research.google.com/github/mlau239/wrangling/blob/main/assignment/nnk5qy_assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Assignment: Data Wrangling
## `! git clone https://github.com/DS3001/wrangling`
## Do Q2, and one of Q1 or Q3.

In [1]:
! git clone https://github.com/DS3001/wrangling


Cloning into 'wrangling'...
remote: Enumerating objects: 92, done.[K
remote: Counting objects: 100% (52/52), done.[K
remote: Compressing objects: 100% (34/34), done.[K
remote: Total 92 (delta 42), reused 18 (delta 18), pack-reused 40 (from 1)[K
Receiving objects: 100% (92/92), 18.08 MiB | 5.36 MiB/s, done.
Resolving deltas: 100% (43/43), done.
Updating files: 100% (21/21), done.


**Q1.** Open the "tidy_data.pdf" document in the repo, which is a paper called Tidy Data by Hadley Wickham.

  1. Read the abstract. What is this paper about?
  2. Read the introduction. What is the "tidy data standard" intended to accomplish?
  3. Read the intro to section 2. What does this sentence mean: "Like families, tidy datasets are all alike but every messy dataset is messy in its own way." What does this sentence mean: "For a given dataset, it’s usually easy to figure out what are observations and what are variables, but it is surprisingly difficult to precisely define variables and observations in general."
  4. Read Section 2.2. How does Wickham define values, variables, and observations?
  5. How is "Tidy Data" defined in section 2.3?
  6. Read the intro to Section 3 and Section 3.1. What are the 5 most common problems with messy datasets? Why are the data in Table 4 messy? What is "melting" a dataset?
  7. Why, specifically, is table 11 messy but table 12 tidy and "molten"?
  8. Read Section 6. What is the "chicken-and-egg" problem with focusing on tidy data? What does Wickham hope happens in the future with further work on the subject of data wrangling?

1. Wickham emphasizes the importance of thinking more abstractly about data cleaning, a process that often doesn't get the attention it deserves. While removing NAs is useful, there's more to consider when determining if data is fully cleaned. He introduces a framework for clean data: each row should represent an observation, each column should correspond to a variable, and each table should reflect a distinct type of observational unit. He then explores the implications of this approach for effective data cleaning.

2. The "tidy data standard" is intended to standardize the data cleaning process. Its goal is to simplify data cleaning by ensuring that everyone involved understands the end goal and the typical steps needed to achieve it.

3. The first sentence means that messy datasets tend to have unique and specific problems, making them more challenging to clean, while tidy datasets all follow a consistent structure. You've likely encountered this issue: data is often organized based on the creator's needs or convenience, requiring significant effort to manipulate effectively in R. This sentence also plays off a famous line from Russian literature.

  The second sentence points out that while the concept of a "data frame" or "matrix" is intuitive — with rows as observations and columns as variables — defining observations and variables can be tricky in practice. This choice often depends on how the data will be analyzed. For example, in a dataset containing information for counties by year, an observation could be a county-year rather than just a county, which isn't always immediately obvious. Misunderstanding this can lead to poor decisions about data cleaning and organization.

4. Wickham defines values, variables, and observations as follows: A dataset consists of values, which can be numeric or categorical (e.g., strings). Each value is associated with both a variable and an observation. A variable is a set of values that represent the same attribute or property (such as height, color, or temperature). An observation is a set of values that describes a specific instance or case being measured.

5. In tidy data, each variable is a column, each observation is a row, and each type of observational unit is a table. If data is not tidy, it is messy.

6. The five most common problems with messy datasets are:

Column headers are values, not variable names: For instance, years like "2012" are used as column headers instead of being a variable called "Year."
Multiple variables stored in one column: A single column might contain several variables, like a date that includes month, day, and year combined.
Variables are stored in both rows and columns: Data may be split across both rows and columns, making it difficult to treat them as distinct variables.
Multiple types of observational units in the same table: Different types of data, such as individuals and firms, are combined in one dataset instead of being separated.
A single observational unit is spread across multiple tables: One observation is unnecessarily split across multiple tables, requiring extra work to consolidate the data.
In Table 4, the data are messy because the columns represent values of a hidden variable (income). Income should be treated as a variable, with its values in a separate column alongside other variables like religion and frequency. This restructuring allows for proper naming of variables.

"Melting" a dataset refers to the process of converting column values into rows, essentially turning wide-format data (where variables are in both rows and columns) into long-format data where each variable is in its own column.

7.
Table 11 is messy because the days are listed as column headers, which are values, not variable names. In Table 12, those days are "melted" into a single variable called "date," but it's still not fully tidy since the "element" column contains variable names (like tmax and tmin), which represent measurements, not values. Table 12(b) is tidy because all the entries are properly represented as attributes, with no variable names acting as values.

8.
Wickham aims for a broader philosophy of data cleaning beyond simply promoting specific tools. If the tidy data framework is only used to support certain tools, it becomes more about marketing. He hopes the tidy concept evolves into a comprehensive ecosystem of ideas and tools that benefits data science as a whole, not just making it easier to use tools like ggplot2.

In [2]:
import pandas as pd
import numpy as np
import seaborn as sns

**Q2.** This question provides some practice cleaning variables which have common problems.
1. Numeric variable: For `./data/airbnb_hw.csv`, clean the `Price` variable as well as you can, and explain the choices you make. How many missing values do you end up with? (Hint: What happens to the formatting when a price goes over 999 dollars, say from 675 to 1,112?)
2. Categorical variable: For the `./data/sharks.csv` data covered in the lecture, clean the "Type" variable as well as you can, and explain the choices you make.
3. Dummy variable: For the pretrial data covered in the lecture, clean the `WhetherDefendantWasReleasedPretrial` variable as well as you can, and, in particular, replace missing values with `np.nan`.
4. Missing values, not at random: For the pretrial data covered in the lecture, clean the `ImposedSentenceAllChargeInContactEvent` variable as well as you can, and explain the choices you make. (Hint: Look at the `SentenceTypeAllChargesAtConvictionInContactEvent` variable.)

In [4]:
import os
print(os.getcwd())

/content


In [3]:
df = pd.read_csv('./data/airbnb_hw.csv', low_memory=False)
print( df.shape, '\n')
df.head()

FileNotFoundError: [Errno 2] No such file or directory: './data/airbnb_hw.csv'

2. Categorical variable:

In [6]:
df = pd.read_csv('./wrangling/data/sharks.csv', low_memory=False)
# df.head()
# df.columns.tolist()

In [7]:
df['Type'].value_counts()

Unnamed: 0_level_0,count
Type,Unnamed: 1_level_1
Unprovoked,4716
Provoked,593
Invalid,552
Sea Disaster,239
Watercraft,142
Boat,109
Boating,92
Questionable,10
Unconfirmed,1
Unverified,1


In [8]:
type = df['Type'] # Create a temporary vector of values for the Type variable to play with

type = type.replace(['Sea Disaster', 'Boat','Boating','Boatomg'],'Watercraft') # All watercraft/boating values
type.value_counts()

type = type.replace(['Invalid', 'Questionable','Unconfirmed','Unverified','Under investigation'],np.nan) # All unclean values
type.value_counts()

df['Type'] = type # Replace the 'Type' variable with the cleaned version
del type # Destroy the temporary vector

df['Type'].value_counts()

Unnamed: 0_level_0,count
Type,Unnamed: 1_level_1
Unprovoked,4716
Provoked,593
Watercraft,583


In [9]:
df['Fatal (Y/N)'] = df['Fatal (Y/N)'].replace(['UNKNOWN', 'F','M','2017'],np.nan) # All unclean values
df['Fatal (Y/N)'] = df['Fatal (Y/N)'].replace('y','Y') # All unclean values
pd.crosstab(df['Type'],df['Fatal (Y/N)'],normalize='index')

Fatal (Y/N),N,Y
Type,Unnamed: 1_level_1,Unnamed: 2_level_1
Provoked,0.967521,0.032479
Unprovoked,0.743871,0.256129
Watercraft,0.684303,0.315697


3. Dummy variable:

In [13]:
df = pd.read_csv('./wrangling/pretrial_data.csv', low_memory=False)
# df.head()
# df.columns.tolist()

In [14]:
release = df['WhetherDefendantWasReleasedPretrial']
print(release.unique(),'\n')
print(release.value_counts(),'\n')
release = release.replace(9,np.nan) # In the codebook, the 9's are "unclear"
print(release.value_counts(),'\n')
sum(release.isnull()) # 31 missing values
df['WhetherDefendantWasReleasedPretrial'] = release # Replace data column with cleaned values
del release

KeyError: 'WhetherDefendantWasReleasedPretrial'

**Q3.** Many important datasets contain a race variable, typically limited to a handful of values often including Black, White, Asian, Latino, and Indigenous. This question looks at data gathering efforts on this variable by the U.S. Federal government.

1. How did the most recent US Census gather data on race?
2. Why do we gather these data? What role do these kinds of data play in politics and society? Why does data quality matter?
3. Please provide a constructive criticism of how the Census was conducted: What was done well? What do you think was missing? How should future large scale surveys be adjusted to best reflect the diversity of the population? Could some of the Census' good practices be adopted more widely to gather richer and more useful data?
4. How did the Census gather data on sex and gender? Please provide a similar constructive criticism of their practices.
5. When it comes to cleaning data, what concerns do you have about protected characteristics like sex, gender, sexual identity, or race? What challenges can you imagine arising when there are missing values? What good or bad practices might people adopt, and why?
6. Suppose someone invented an algorithm to impute values for protected characteristics like race, gender, sex, or sexuality. What kinds of concerns would you have?