# HW 2: Wrangling

**Q1.** This question provides some practice cleaning variables which have common problems.
1. Numeric variable: For `./data/airbnb_hw.csv`, clean the `Price` variable as well as you can, and explain the choices you make. How many missing values do you end up with? (Hint: What happens to the formatting when a price goes over 999 dollars, say from 675 to 1,112?)

The Price variable was initially an object. I converted it into a numeric (float) to find how many rows hold no value, or N/A. There is 181 missing values in the Price column.

2. Categorical variable: For the Minnesota police use of for data, `./data/mn_police_use_of_force.csv`, clean the `subject_injury` variable, handling the NA's; this gives a value `Yes` when a person was injured by police, and `No` when no injury occurred
. What proportion of the values are missing? Is this a concern? Cross-tabulate your cleaned `subject_injury` variable with the `force_type` variable. Are there any patterns regarding when the data are missing?

The subject_injury variable was cleaned by standardizing responses to Yes and No and labeling N/A values as missing. After cleaning, about 76% of the values are missing, which is a serious concern because most incidents do not report whether an injury occured. The cross-tabulation with force_type shows that missing values are not evenly distributed, with large amounts of missing data for catergories like Bodily Force, Chemical Irritant, and Taser which suggests that injusty reporting may be less complete for certain types of force.

3. Dummy variable: For the pretrial data covered in the lecture, clean the `WhetherDefendantWasReleasedPretrial` variable as well as you can, and, in particular, replace missing values with `np.nan`.
4. Missing values, not at random: For the pretrial data covered in the lecture, clean the `ImposedSentenceAllChargeInContactEvent` variable as well as you can, and explain the choices you make. (Hint: Look at the `SentenceTypeAllChargesAtConvictionInContactEvent` variable.)

In [3]:
#Q1.1
import pandas as pd

data = pd.read_csv("airbnb_hw.csv")
data.head()

Unnamed: 0,Host Id,Host Since,Name,Neighbourhood,Property Type,Review Scores Rating (bin),Room Type,Zipcode,Beds,Number of Records,Number Of Reviews,Price,Review Scores Rating
0,5162530,,1 Bedroom in Prime Williamsburg,Brooklyn,Apartment,,Entire home/apt,11249.0,1.0,1,0,145,
1,33134899,,"Sunny, Private room in Bushwick",Brooklyn,Apartment,,Private room,11206.0,1.0,1,1,37,
2,39608626,,Sunny Room in Harlem,Manhattan,Apartment,,Private room,10032.0,1.0,1,1,28,
3,500,6/26/2008,Gorgeous 1 BR with Private Balcony,Manhattan,Apartment,,Entire home/apt,10024.0,3.0,1,0,199,
4,500,6/26/2008,Trendy Times Square Loft,Manhattan,Apartment,95.0,Private room,10036.0,3.0,1,39,549,96.0


In [11]:
#Q1.1
data["Price"].dtype

dtype('O')

In [14]:
#Q1.1
data["Price"] = pd.to_numeric(data["Price"], errors="coerce")

In [15]:
#Q1.1
data["Price"].dtype

dtype('float64')

In [17]:
#Q1.1
data["Price"].isna().sum()

np.int64(181)

In [18]:
#Q1.2
police = pd.read_csv("mn_police_use_of_force.csv")
police.head()

Unnamed: 0,response_datetime,problem,is_911_call,primary_offense,subject_injury,force_type,force_type_action,race,sex,age,type_resistance,precinct,neighborhood
0,2016/01/01 00:47:36,Assault in Progress,Yes,DASLT1,,Bodily Force,Body Weight to Pin,Black,Male,20.0,Tensed,1,Downtown East
1,2016/01/01 02:19:34,Fight,No,DISCON,,Chemical Irritant,Personal Mace,Black,Female,27.0,Verbal Non-Compliance,1,Downtown West
2,2016/01/01 02:19:34,Fight,No,DISCON,,Chemical Irritant,Personal Mace,White,Female,23.0,Verbal Non-Compliance,1,Downtown West
3,2016/01/01 02:28:48,Fight,No,PRIORI,,Chemical Irritant,Crowd Control Mace,Black,Male,20.0,Commission of Crime,1,Downtown West
4,2016/01/01 02:28:48,Fight,No,PRIORI,,Chemical Irritant,Crowd Control Mace,Black,Male,20.0,Commission of Crime,1,Downtown West


In [21]:
#Q1.2
police["subject_injury"].value_counts(dropna=False)

Unnamed: 0_level_0,count
subject_injury,Unnamed: 1_level_1
,9848
Yes,1631
No,1446


In [25]:
#Q1.2
missing_prop = (police["subject_injury"].isna().mean())
missing_prop

np.float64(0.7619342359767892)

In [26]:
#Q1.2
pd.crosstab(police["force_type"], police["subject_injury_clean"])

subject_injury_clean,Missing,No,Yes
force_type,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Baton,2,0,2
Bodily Force,7051,1093,1286
Chemical Irritant,1421,131,41
Firearm,0,2,0
Gun Point Display,27,33,44
Improvised Weapon,74,34,40
Less Lethal,87,0,0
Less Lethal Projectile,0,1,2
Maximal Restraint Technique,170,0,0
Police K9 Bite,31,2,44


In [27]:
#Q1.2
police["subject_injury"].isna().sum() / len(police)

np.float64(0.7619342359767892)

**Q2.** Go to https://sharkattackfile.net/ and download their dataset on shark attacks.

1. Open the shark attack file using Pandas. It is probably not a csv file, so `read_csv` won't work.
2. Drop any columns that do not contain data.
3. Clean the year variable. Describe the range of values you see. Filter the rows to focus on attacks since 1940. Are attacks increasing, decreasing, or remaining constant over time?
4. Clean the Age variable and make a histogram of the ages of the victims.
5. What proportion of victims are male?
6. Clean the `Type` variable so it only takes three values: Provoked and Unprovoked and Unknown. What proportion of attacks are unprovoked?
7. Clean the `Fatal Y/N` variable so it only takes three values: Y, N, and Unknown.
8. Are sharks more likely to launch unprovoked attacks on men or women? Is the attack more or less likely to be fatal when the attack is provoked or unprovoked? Is it more or less likely to be fatal when the victim is male or female? How do you feel about sharks?
9. What proportion of attacks appear to be by white sharks? (Hint: `str.split()` makes a vector of text values into a list of lists, split by spaces.)

**Q3.** Open the "tidy_data.pdf" document in the repo, which is a paper called Tidy Data by Hadley Wickham.

  1. Read the abstract. What is this paper about?

  The paper is about how to organize data in a simple, consistent way called tidy data so it is easier to analyze and visualize. It explains that many problems in data analysis come from messy data layouts and argues that each variable should be a column, each observation a row, and each type of data its own table, which makes working with data much easier and more reliable.

  2. Read the introduction. What is the "tidy data standard" intended to accomplish?

  The tidy data standard is meant to make data easier to clean, explore, and analyze by giving it a consistent structure. By organizing data the same way every time, analysts don’t have to keep reshaping datasets between steps, and tools can work together more smoothly which lets people focus on the analysis instead of data formatting

  3. Read the intro to section 2. What does this sentence mean: "Like families, tidy datasets are all alike but every messy dataset is messy in its own way." What does this sentence mean: "For a given dataset, it’s usually easy to figure out what are observations and what are variables, but it is surprisingly difficult to precisely define variables and observations in general."

  The first sentence means that tidy datasets tend to look similar and follow the same basic structure, but messy datasets can be messy in many different and unpredictable ways, so there’s no single fix that works for all of them. The second sentence means that within one specific dataset, it’s usually clear what counts as a variable and what counts as an observation, but when you try to make general rules that work for all datasets and all situations, those definitions can get blurry and depend on context and how the data will be used.

  4. Read Section 2.2. How does Wickham define values, variables, and observations?

  Wickham explains that values are the actual data entries (numbers or text), variables are groups of values that measure the same thing (like height or temperature), and observations are all the values collected for one unit, such as one person, one day, or one event. In other words, variables describe what is being measured, and observations describe who or what is being measured.

  5. How is "Tidy Data" defined in section 2.3?

  In 2.3, tidy data is defined by three simple rules: each variable is a column, each observation is a row, and each type of observational unit is stored in its own table. This structure makes datasets easier to understand and easier to use with analysis tools because everything is organize.

  6. Read the intro to Section 3 and Section 3.1. What are the 5 most common problems with messy datasets? Why are the data in Table 4 messy? What is "melting" a dataset?

   Messy datasets usually have one or more of these problems: column headers are values instead of variable names, multiple variables are stored in one column, variables appear in both rows and columns, different types of observations are mixed in the same table, or the same type of observation is split across multiple tables. Table 4 is messy because the income ranges are used as column headers instead of being stored as values in a single income variable, which spreads one variable across many columns and makes analysis harder. “Melting” a dataset means reshaping it so that those columns are stacked into rows, creating one column for the variable (like income) and one for the values, which helps turn the data into tidy form.

  7. Why, specifically, is table 11 messy but table 12 is tidy and "molten"?

  Table 11 is messy because variables are split across both rows and columns: days are spread across many columns (d1–d31) and measurement type (tmin, tmax) is stored in rows, so neither is in a single column. Table 12(a) is “molten” because the day columns are stacked into rows, creating a value column and a column for measurement type. Table 12(b) is fully tidy because each variable has its own column (like date, tmin, and tmax) and each row represents one day’s observation.

  8. Read Section 6. What is the "chicken-and-egg" problem with focusing on tidy data? What does Wickham hope happens in the future with further work on the subject of data wrangling?
  
  Wickham says the “chicken-and-egg” problem is that tidy data is only useful if there are good tools that work with it, but tools are often designed around existing data formats, so it’s hard to improve one without improving the other at the same time. He hopes that future work will build on the tidy data idea to create better data structures and better tools together, and that researchers will study how people actually work with data to design systems that reduce friction in data cleaning and analysis.