In [1]:
! git clone https://github.com/DS3001/wrangling

Cloning into 'wrangling'...
remote: Enumerating objects: 92, done.[K
remote: Counting objects: 100% (49/49), done.[K
remote: Compressing objects: 100% (35/35), done.[K
remote: Total 92 (delta 36), reused 17 (delta 14), pack-reused 43 (from 1)[K
Receiving objects: 100% (92/92), 18.19 MiB | 6.37 MiB/s, done.
Resolving deltas: 100% (41/41), done.


**Q1**
1. This paper is about methods for data tidying, organizing data in a consistent and easy to manipulate structure.
2. It's supposed to provide a standard way to organise data values wihtin a dataset. This makes initial data cleaning easier because you don't need to start from scratch each time. It facilitates initial exploration and analysis of the data and simplifies the development of data analysis tools.
3. Tidy datasets are all clean in the same way, but messy datasets are each flawed in their own special ways.
4. Values are usually numbers or strings and are used to create a dataset. Variables contain all values that measure the same underlying attribute across units. Observations contain all values measured on the same unit across attributes.
5. Tidy data is a standard way of mapping the meaning of a dataset to its structure in which each variable forms a column, each observation forms a row, and each type of observational unit forms a table.
6. The five most common problems are column headers are values not variable names, multiple variables are stored in one column, variables are stored in both rows and columns, multiple types of observational units are stored in the same table, and a single observational unit is stored in multiple tables. In Table 4, variables form both the rows and columns and column headers are values, not variable names. Melting a dataset is when you turn columns into rows.
7. Table 11 is messy because it has variables in columns, spread across columns, and across rows among other issues. Table 12 is tidy because it has been melted and standardized and missing values have been dropped.
8. If tidy data is only as useful as the tools that work with it, then tidy tools will be inextricably linked to tidy data. He hopes that others will build on this framework to develop even better data storage strategies and better tools.

In [32]:
import numpy as np
import pandas as pd
df = pd.read_csv('/content/wrangling/assignment/data/airbnb_hw.csv')

In [14]:
df.head()
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30478 entries, 0 to 30477
Data columns (total 13 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   Host Id                     30478 non-null  int64  
 1   Host Since                  30475 non-null  object 
 2   Name                        30478 non-null  object 
 3   Neighbourhood               30478 non-null  object 
 4   Property Type               30475 non-null  object 
 5   Review Scores Rating (bin)  22155 non-null  float64
 6   Room Type                   30478 non-null  object 
 7   Zipcode                     30344 non-null  float64
 8   Beds                        30393 non-null  float64
 9   Number of Records           30478 non-null  int64  
 10  Number Of Reviews           30478 non-null  int64  
 11  Price                       30478 non-null  object 
 12  Review Scores Rating        22155 non-null  float64
dtypes: float64(4), int64(3), object

In [15]:
price = df['Price']
price.unique()

array(['145', '37', '28', '199', '549', '149', '250', '90', '270', '290',
       '170', '59', '49', '68', '285', '75', '100', '150', '700', '125',
       '175', '40', '89', '95', '99', '499', '120', '79', '110', '180',
       '143', '230', '350', '135', '85', '60', '70', '55', '44', '200',
       '165', '115', '74', '84', '129', '50', '185', '80', '190', '140',
       '45', '65', '225', '600', '109', '1,990', '73', '240', '72', '105',
       '155', '160', '42', '132', '117', '295', '280', '159', '107', '69',
       '239', '220', '399', '130', '375', '585', '275', '139', '260',
       '35', '133', '300', '289', '179', '98', '195', '29', '27', '39',
       '249', '192', '142', '169', '1,000', '131', '138', '113', '122',
       '329', '101', '475', '238', '272', '308', '126', '235', '315',
       '248', '128', '56', '207', '450', '215', '210', '385', '445',
       '136', '247', '118', '77', '76', '92', '198', '205', '299', '222',
       '245', '104', '153', '349', '114', '320', '292', '22

There's a comma in the prices over a thousand that makes them read as a string.

In [16]:
price = price.str.replace(',','')
print( price.unique() , '\n')
price = pd.to_numeric(price,errors='coerce')
print( price.unique() , '\n')
print( 'Total missing: ', sum( price.isnull() ) )

['145' '37' '28' '199' '549' '149' '250' '90' '270' '290' '170' '59' '49'
 '68' '285' '75' '100' '150' '700' '125' '175' '40' '89' '95' '99' '499'
 '120' '79' '110' '180' '143' '230' '350' '135' '85' '60' '70' '55' '44'
 '200' '165' '115' '74' '84' '129' '50' '185' '80' '190' '140' '45' '65'
 '225' '600' '109' '1990' '73' '240' '72' '105' '155' '160' '42' '132'
 '117' '295' '280' '159' '107' '69' '239' '220' '399' '130' '375' '585'
 '275' '139' '260' '35' '133' '300' '289' '179' '98' '195' '29' '27' '39'
 '249' '192' '142' '169' '1000' '131' '138' '113' '122' '329' '101' '475'
 '238' '272' '308' '126' '235' '315' '248' '128' '56' '207' '450' '215'
 '210' '385' '445' '136' '247' '118' '77' '76' '92' '198' '205' '299'
 '222' '245' '104' '153' '349' '114' '320' '292' '226' '420' '500' '325'
 '307' '78' '265' '108' '123' '189' '32' '58' '86' '219' '800' '335' '63'
 '229' '425' '67' '87' '1200' '158' '650' '234' '310' '695' '400' '166'
 '119' '62' '168' '340' '479' '43' '395' '144' '52' '47

In [20]:
shark = pd.read_csv('/content/wrangling/data/sharks.csv', low_memory=False)
shark.head()

Unnamed: 0,index,Case Number,Date,Year,Type,Country,Area,Location,Activity,Name,...,Unnamed: 246,Unnamed: 247,Unnamed: 248,Unnamed: 249,Unnamed: 250,Unnamed: 251,Unnamed: 252,Unnamed: 253,Unnamed: 254,Unnamed: 255
0,0,2020.02.05,05-Feb-2020,2020.0,Unprovoked,USA,Maui,,Stand-Up Paddle boarding,,...,,,,,,,,,,
1,1,2020.01.30.R,Reported 30-Jan-2020,2020.0,Provoked,BAHAMAS,Exumas,,Floating,Ana Bruna Avila,...,,,,,,,,,,
2,2,2020.01.17,17-Jan-2020,2020.0,Unprovoked,AUSTRALIA,New South Wales,Windang Beach,Surfing,Will Schroeter,...,,,,,,,,,,
3,3,2020.01.16,16-Jan-2020,2020.0,Unprovoked,NEW ZEALAND,Southland,Oreti Beach,Surfing,Jordan King,...,,,,,,,,,,
4,4,2020.01.13,13-Jan-2020,2020.0,Unprovoked,USA,North Carolina,"Rodanthe, Dare County",Surfing,Samuel Horne,...,,,,,,,,,,


In [21]:
shark['Type'].value_counts()

Unnamed: 0_level_0,count
Type,Unnamed: 1_level_1
Unprovoked,4716
Provoked,593
Invalid,552
Sea Disaster,239
Watercraft,142
Boat,109
Boating,92
Questionable,10
Unconfirmed,1
Unverified,1


In [34]:
type = shark['Type']

type = type.replace(['Sea Disaster', 'Boat','Boating','Boatomg'],'Watercraft')
type.value_counts()

type = type.replace(['Invalid', 'Questionable','Unconfirmed','Unverified','Under investigation'],np.nan)
type.value_counts()

shark['Type'] = type
del type

shark['Type'].value_counts()

Unnamed: 0_level_0,count
Type,Unnamed: 1_level_1
Unprovoked,4716
Provoked,593
Watercraft,583
,565


Narrow down all the types into more concentrated categories.

In [30]:
trial = pd.read_csv('/content/wrangling/pretrial_data.csv', low_memory=False)
trial.head()
trial.columns.tolist()

['Unnamed: 0',
 'InternalStudyID',
 'REQ_REC#',
 'Defendant_Sex',
 'Defendant_Race',
 'Defendant_BirthYear',
 'Defendant_Age',
 'Defendant_AgeGroup',
 'Defendant_AgeatCurrentArrest',
 'Defendant_AttorneyTypeAtCaseClosure',
 'is_poor',
 'Defendant_RecordedZipCode_eMag',
 'Defendant_VirginiaResidencyStatus',
 'released',
 'PretrialReleaseDate',
 'DaysBetweenContactEventAndPretrialRelease',
 'PretrialReleaseType1',
 'PretrialReleaseType2',
 'BondTypeAtInitialContact',
 'bond',
 'BondTypeAtRelease_v1',
 'BondTypeatRelease_v2',
 'BondAmountAtRelease',
 'WhetherDefendantReceivedPretrialServicesAgencySuperv_PTCC',
 'DaysBetweenReleaseandActivePretrialServicesAgencySupervDate',
 'DaysBetweenPretrialServicesAgencySupervReferralDateandSupervDate',
 'Indicator_PresumptiveDenialOfBail_19.2_120',
 'Indicator_ConditionsToBeReleasedSecuredBond_19.2_123',
 'IfReleasedonSecuredBond_TypeofSurety',
 'Indicator_BailTermSetByCourt_eMag',
 'AdditionalJailTimeServedAfterInitialPretrialRelease',
 'Oct2017_Con

In [29]:
release = trial['WhetherDefendantWasReleasedPretrial']
print(release.unique(),'\n')
print(release.value_counts(),'\n')
release = release.replace(9,np.nan)
print(release.value_counts(),'\n')
sum(release.isnull())
trial['WhetherDefendantWasReleasedPretrial'] = release
del release

KeyError: 'WhetherDefendantWasReleasedPretrial'

Error: There is no "WhetherDefendantWasReleasedPretrial" variable.

In [35]:
length = trial['ImposedSentenceAllChargeInContactEvent']
type = trial['SentenceTypeAllChargesAtConvictionInContactEvent']

length = pd.to_numeric(length,errors='coerce')
length_NA = length.isnull()
print( np.sum(length_NA),'\n')

print( pd.crosstab(length_NA, type), '\n')

length = length.mask( type == 4, 0)
length = length.mask( type == 9, np.nan)

length_NA = length.isnull()
print( pd.crosstab(length_NA, type), '\n')
print( np.sum(length_NA),'\n')

df['ImposedSentenceAllChargeInContactEvent'] = length
del length, type

9053 

SentenceTypeAllChargesAtConvictionInContactEvent     0     1    2     4    9
ImposedSentenceAllChargeInContactEvent                                      
False                                             8720  4299  914     0    0
True                                                 0     0    0  8779  274 

SentenceTypeAllChargesAtConvictionInContactEvent     0     1    2     4    9
ImposedSentenceAllChargeInContactEvent                                      
False                                             8720  4299  914  8779    0
True                                                 0     0    0     0  274 

274 



**Q3**
1. It first asked if the respondant was of Hispanic, Latino, or Spanish origin. It then asked the respondant to check which races they were from a list, allowing for multiple to be checked. There was no race for those from Hispanic, Latino, or Spanish origin in the second part of the question, leaving many to select "some other race." There was also no option for those from a Middle Eastern background, leaving them to select "some other race" as well.
2. Census data aids with redistricting and apportionment for the House of Representatives. It also helps with federal funding allocation to at risk communities. It also helps the government and partners understanding social dynamics and socioeconomic demographic data over time. Census data helps prevent the practice of gerrymandering and ensures all votes are given equal power. Quality data matters because without it, ethnic groups could be misrepresented in the political sphere. Federal and state funding could come up short if inaccurate data is collected.
3. The Census did a good job of reaching hard-to-reach communities in the US such as Native Americans, immigrants, and those from low-income communities. They offered multiple ways to respond including online which made the Census much more accessible. The Census did not ask about gender identity which plays a major role in many Americans identity. Many likely felt underrepresented without the inclusion of this question. The separation of Hispanic and Latino from race likely caused confusion among many. Additionally, the lack of a Middle Eastern race category likely left many feeling underrepresented as well. In the future, race and ethnicity questions should be combined to avoid confusion. Questions on gender identity and sexual orientation should be included to gain a broader understanding of the population. There should be further guidance and education on the aims of the Census and how the data is used AND not used. The Census's practice of multilingual options should definitely be included in other large scale surveys to include a more diverse sample of people. Partnering with local organizations also helps increase participation from underrepresented communities.
4. The Census only asked respondents if they were male or female and did not include other options. It also did not expand further and ask about gender identity to include non-binary and transgender. It is good practice to keep the response options simple and easy to follow. These options were also consistent with past data. Additionally, some may be concerned that their gender identity would be leaked or used against them somehow as it is a sensitive subject. However, not collecting data on transgender and non-binary peoples decreases visibility of gender diversity and makes it more challenging for the US to track this growing demographic and create corresponding policies. Conflating gender with sex also supports outdated thought practices. They should offer both sex and gender identity questions and include a question on transgender status.
5. Concerns on bias in the researcher arise. Some may conflate gender identity with sex which will skew the data, especially when compared to data that does not. These topics can be sensitive data that may leave respondents worried about data leaks or re-identification. With missing values, some groups may be more likely to leave missing data out of fear or concern for what is being done with that data. Omitting missing values entirely can heavily skew data. Researchers must not make assumptions on race, gender identity, or sexual identity based on other variables collected. Good practices for handling this data include transparency, privacy, and avoiding assumptions. Researchers must be transparent on why they are collecting this data and what they intend to do with it. This fosters trust between the respondent and makes them more likely to answer truthfully. Data must be fully de-identified and stored safely to avoid security/privacy risks. Lastly, researchers must not make assumptions when dealing with missing data.
6. Imputing these values would pose major risks of perpetuating harmful and inaccurate stereotypes. This would lead to further biased and discriminatory data that can have harmful affects on communities. It also leads to risks of re-identification. Imputing values may add enough extra data to be able to decipher who each individual is and put their privacy at risk. The data is also unlikely to be wholly accurate. Protected characteristics are not straight-forward and making assumptions on them leads to inaccurate data.