## Fixing Incorrect Values in the State Column
In this exercise, you will clean the State variable in a modified version of a dataset by listing all the finance officers in the USA. We are doing this because the dataset contains some incorrect values. Let's get started

In [1]:
import pandas as pd

In [2]:
file_url = ('https://raw.githubusercontent.com/PacktWorkshops/'\
            'The-Data-Science-Workshop/master/Chapter11/dataset/officers.csv')

In [3]:
df = pd.read_csv(file_url)

In [4]:
df.head()

Unnamed: 0,ID,City,State,Zip,Title,RedactionRequested
0,804,Glenview,IL,60025,Treasurer,False
1,9177,Harrisburg,IL,62946,Treasurer,False
2,53011,Chicago,IL,60606,Treasurer,False
3,9176,Harrisburg,IL,62946,Chairman,False
4,33020,Mechanicsburg,IL,62545,Chairman,False


In [5]:
# unique values of State variable
df['State'].unique()

array(['IL', 'PA', 'DC', 'Il', nan, 'WI', 'CA', 'MO', 'NC', 'IA', 'MA',
       'IN', 'MI', 'TN', 'NY', 'ng', 'TX', 'CO', 'NV', 'il', 'WA', '8I',
       'In', 'iL', 'OH', 'SC', 'VA', 'NM', 'FL', 'LA', 'GA', 'II', 'NJ',
       'MD', 'I', 'AR', 'KS', 'DE', '60', 'SD', 'MN', 'VT', 'OK', 'KY',
       'CT', 'NH', 'AZ', 'OR', 'PR', 'RI'], dtype=object)

All the states have been encoded into a two-capitalized character format. As you can see, there are some incorrect values with non-capitalized characters, such as il and iL (they look like spelling errors for Illinois), and unexpected values such as 8I, I, and 60. In the next few steps, you are going to fix these issues.

Print out the rows that have the il value in the State column using the pandas .str.contains() method and the subsetting API, that is, DataFrame [condition]. You will also have to set the na parameter to False in str.contains() in order to exclude observations with missing values:

In [6]:
df['State'].str.contains('il', na=False)

0        False
1        False
2        False
3        False
4        False
         ...  
55209    False
55210    False
55211    False
55212    False
55213    False
Name: State, Length: 55214, dtype: bool

In [7]:
df[df['State'].str.contains('il', na=False)]

Unnamed: 0,ID,City,State,Zip,Title,RedactionRequested
4245,47448,Chicago,il,60619,Treasurer,False
4651,47447,Chicago,il,60623-1614,Chairman,False
4652,54025,Chicago,il,60623-1614,Chairman,False
18939,39418,Kingston,il,60145,Chairman,False
29699,27124,Hampshire,il,60140,Chairman,False
43761,29179,McHenry,il,60050,Admin Asst,False


Now, create a for loop that will iterate through the following values in the State column: Il, iL, Il. Then, print out the values of the City and State variables using the pandas method for subsetting, that is, .loc(): DataFrame.loc[row_condition, column condition]

In [8]:
for state in ['il', 'Il', 'iL', 'Il']:
    print(df.loc[df['State'] == state, ['City', 'State']])

            City State
4245     Chicago    il
4651     Chicago    il
4652     Chicago    il
18939   Kingston    il
29699  Hampshire    il
43761    McHenry    il
            City State
43        Ottawa    Il
44        Ottawa    Il
493    Galesburg    Il
613      Chicago    Il
614      Chicago    Il
...          ...   ...
54915    Chicago    Il
54916    Chicago    Il
54918    Chicago    Il
54919    Chicago    Il
54921    Chicago    Il

[665 rows x 2 columns]
         City State
7052  Wheaton    iL
            City State
43        Ottawa    Il
44        Ottawa    Il
493    Galesburg    Il
613      Chicago    Il
614      Chicago    Il
...          ...   ...
54915    Chicago    Il
54916    Chicago    Il
54918    Chicago    Il
54919    Chicago    Il
54921    Chicago    Il

[665 rows x 2 columns]


Create a condition mask (**il_mask**) to subset all the rows that contain the four incorrect values (**il, Il, iL, and II**) by using the isin() method and a list of these values as a parameter. Then, save the result into a variable called il_mask

In [9]:
il_mask = df['State'].isin(['il', 'Il', 'iL', 'Il'])

Print the number of rows that match the condition we set in il_mask using the .sum() method. This will sum all the rows that have a value of True (they match the condition)

In [10]:
il_mask.sum()

672

Using the pandas .loc() method, subset the rows with the il_mask condition mask and replace the value of the State column with IL

In [11]:
df.loc[il_mask, 'State'] = 'IL'

In [12]:
df['State'].unique()

array(['IL', 'PA', 'DC', nan, 'WI', 'CA', 'MO', 'NC', 'IA', 'MA', 'IN',
       'MI', 'TN', 'NY', 'ng', 'TX', 'CO', 'NV', 'WA', '8I', 'In', 'OH',
       'SC', 'VA', 'NM', 'FL', 'LA', 'GA', 'II', 'NJ', 'MD', 'I', 'AR',
       'KS', 'DE', '60', 'SD', 'MN', 'VT', 'OK', 'KY', 'CT', 'NH', 'AZ',
       'OR', 'PR', 'RI'], dtype=object)

As you can see, the four incorrect values are not present anymore. Let's have a look at the other remaining incorrect values: II, I, 8I, and 60. We will look at dealing II in the next step.

Print out the rows that have a value of **II** into the State column using the pandas subsetting API, that is, DataFrame.loc[row_condition, column_condition]

In [13]:
df.loc[df['State'] == 'II']

Unnamed: 0,ID,City,State,Zip,Title,RedactionRequested
14340,28039,Bloomington,II,61704,Co-Chairman,False
14341,31994,Bloomington,II,61704,Chairman,False


Now, create a for loop that iterates through the three incorrect values (I, 8I, and 60) and print out the subsetted rows using the same logic that we used in Step 12. Only display the City and State

In [19]:
for state in ['I', '8I', '60']:
    print('\n{}'.format(df.loc[df['State'] == state, ['City', 'State']]))


              City State
17596  Bloomington     I

             City State
5513  Springfield    8I

          City State
28060  Chicago    60


Create a for loop that iterates through the four incorrect values (II, I, 8I, and 60) and reuse the subsetting logic from Step 12 to replace the value in State with IL:

In [20]:
for val in ['I', '8I', '60', 'II']:
    df.loc[df['State'] == val, 'State'] = 'IL'

In [21]:
df['State'].unique()

array(['IL', 'PA', 'DC', nan, 'WI', 'CA', 'MO', 'NC', 'IA', 'MA', 'IN',
       'MI', 'TN', 'NY', 'ng', 'TX', 'CO', 'NV', 'WA', 'In', 'OH', 'SC',
       'VA', 'NM', 'FL', 'LA', 'GA', 'NJ', 'MD', 'AR', 'KS', 'DE', 'SD',
       'MN', 'VT', 'OK', 'KY', 'CT', 'NH', 'AZ', 'OR', 'PR', 'RI'],
      dtype=object)

In [23]:
# fix the remaining incorrect values
for state in ['In', 'ng']:
    print('\n{}'.format(df.loc[df['State'] == state, ['City', 'State']]))


           City State
5733  Sherville    In

            City State
2428  none given    ng
2961  none given    ng


In [24]:
# correct Sherville to IN
df.loc[df['State'] == 'In', 'State'] = 'IN'

In [25]:
df['State'].unique()

array(['IL', 'PA', 'DC', nan, 'WI', 'CA', 'MO', 'NC', 'IA', 'MA', 'IN',
       'MI', 'TN', 'NY', 'ng', 'TX', 'CO', 'NV', 'WA', 'OH', 'SC', 'VA',
       'NM', 'FL', 'LA', 'GA', 'NJ', 'MD', 'AR', 'KS', 'DE', 'SD', 'MN',
       'VT', 'OK', 'KY', 'CT', 'NH', 'AZ', 'OR', 'PR', 'RI'], dtype=object)