In [1]:
import csv
import pandas as pd

In [2]:
census = pd.read_csv('census_data.csv', index_col = 0)

In [3]:
census.head()

Unnamed: 0,first_name,last_name,birth_year,voted,num_children,income_year,higher_tax,marital_status
0,Denise,Ratke,2005,False,0,92129.41,disagree,single
1,Hali,Cummerata,1987,False,0,75649.17,neutral,divorced
2,Salomon,Orn,1992,True,2,166313.45,agree,single
3,Sarina,Schiller,1965,False,2,71704.81,strongly agree,married
4,Gust,Abernathy,1945,False,2,143316.08,agree,married


In [4]:
census.dtypes

first_name         object
last_name          object
birth_year         object
voted                bool
num_children        int64
income_year       float64
higher_tax         object
marital_status     object
dtype: object

The manager of the census would like to know the average birth year of the respondents. We were able to see from .dtypes that birth_year has been assigned the str datatype whereas it should be expressed in int.

Print the unique values of the variable using the .unique() method.

In [5]:
census['birth_year'].unique()

array(['2005', '1987', '1992', '1965', '1945', '1951', '1963', '1949',
       '1950', '1971', '2007', '1944', '1995', '1973', '1946', '1954',
       '1994', '1989', '1947', '1993', '1976', '1984', 'missing', '1966',
       '1941', '2000', '1953', '1956', '1960', '2001', '1980', '1955',
       '1985', '1996', '1968', '1979', '2006', '1962', '1981', '1959',
       '1977', '1978', '1983', '1957', '1961', '1982', '2002', '1998',
       '1999', '1952', '1940', '1986', '1958'], dtype=object)

There is at least 1 missing value. Let's check how many cells in the column 'birth_year' have a missing value.
I found a usefull resource explaining <a href='https://practicaldatascience.co.uk/data-science/how-to-use-isna-to-check-for-missing-values-in-pandas-dataframes#:~:text=The%20easiest%20way%20to%20check,a%20load%20of%20boolean%20values.'>how to check for missing data</a>.

That’s not usually very useful, so instead we’ll calculate the sum() of missing values by running df.isna().sum(). This returns the columns in our Pandas dataframe along with the number of missing values detected in each one, so 0 means there are no missing values, and 1 means there is a single missing value.

In [6]:
census.isna().sum()

first_name        0
last_name         0
birth_year        0
voted             0
num_children      0
income_year       0
higher_tax        0
marital_status    0
dtype: int64

The DataFrame has no missing value. Observing better I would have noticed that 'missing' is a string.
Let's check how many times 'missing' is in the column 'birth_year'.

In [7]:
census['birth_year'].value_counts().missing

1

There appears to be a missing value in the birth_year column. With some research you find that the respondent’s birth year is 1967.

Use the .replace() method to replace the missing value with 1967, so that the data type can be changed to int. Then recheck the values in birth_year by calling the .unique() method and printing the results.

In [8]:
census['birth_year'] = census['birth_year'].replace(['missing'], 1967)

Now a change variable type in Int

In [9]:
census['birth_year'] = census['birth_year'].astype('int')
census['birth_year'].dtypes

dtype('int32')

Having assigned birth_year to the appropriate data type, print the average birth year of the respondents to the census using the pandas .mean() method.

In [10]:
birth_year_mean = census['birth_year'].mean()
print(birth_year_mean)

1973.4


Your manager would like to set an order to the higher_tax variable so that: strongly disagree < disagree < neutral < agree < strongly agree.

Convert the higher_tax variable to the category data type with the appropriate order, then print the new order using the .unique() method.

In [11]:
census['higher_tax'].unique()

array(['disagree', 'neutral', 'agree', 'strongly agree',
       'strongly disagree'], dtype=object)

In [12]:
census['higher_tax'] = pd.Categorical(census['higher_tax'], 
                    ['strongly disagree', 'disagree', 'neutral', 'agree', 'strongly agree'], ordered=True)

In [13]:
print(census['higher_tax'].unique())

['disagree', 'neutral', 'agree', 'strongly agree', 'strongly disagree']
Categories (5, object): ['strongly disagree' < 'disagree' < 'neutral' < 'agree' < 'strongly agree']


Your manager would also like to know the median sentiment of the respondents on the issue of higher taxes for the wealthy. Label encode the higher_tax variable and print the median using the pandas .median() method.

In [14]:
census['higher_tax_codes'] = census['higher_tax'].cat.codes

In [15]:
print(census['higher_tax_codes'].unique())

[1 2 3 4 0]


In [16]:
census.head()

Unnamed: 0,first_name,last_name,birth_year,voted,num_children,income_year,higher_tax,marital_status,higher_tax_codes
0,Denise,Ratke,2005,False,0,92129.41,disagree,single,1
1,Hali,Cummerata,1987,False,0,75649.17,neutral,divorced,2
2,Salomon,Orn,1992,True,2,166313.45,agree,single,3
3,Sarina,Schiller,1965,False,2,71704.81,strongly agree,married,4
4,Gust,Abernathy,1945,False,2,143316.08,agree,married,3


In [17]:
higher_tax_median = census['higher_tax_codes'].median()
print(higher_tax_median)

2.0



Your manager is interested in using machine learning models on the census data in the future. To help, let’s One-Hot Encode marital_status to create binary variables of each category. Use the pandas get_dummies() method to One-Hot Encode the marital_status variable.

Print the first five rows of the new dataframe with the .head() method. Note that you’ll have to scroll to the right or expand the web-browser to see the dummy variables.

In [None]:
census = pd.get_dummies(data=census, columns=['marital_status'])

In [19]:
census.head()

Unnamed: 0,first_name,last_name,birth_year,voted,num_children,income_year,higher_tax,marital_status,higher_tax_codes
0,Denise,Ratke,2005,False,0,92129.41,disagree,single,1
1,Hali,Cummerata,1987,False,0,75649.17,neutral,divorced,2
2,Salomon,Orn,1992,True,2,166313.45,agree,single,3
3,Sarina,Schiller,1965,False,2,71704.81,strongly agree,married,4
4,Gust,Abernathy,1945,False,2,143316.08,agree,married,3


Congratulations! You have used your variable skills to help the census team with managing their data. Feel free to explore the data further. There are additional operations you can perform on the data, such as:

Create a new variable called marital_codes by Label Encoding the marital_status variable. This could help the Census team use machine learning to predict if a respondent thinks the wealthy should pay higher taxes based on their marital status.

In [21]:
print(census['marital_status'].unique())

['single' 'divorced' 'married' 'widowed']


In [22]:
census['marital_status'] = pd.Categorical(census['marital_status'], ['single', 'divorced', 'married', 'widowed'])

In [23]:
census['marital_status_codes'] = census['marital_status'].cat.codes

In [24]:
print(census['marital_status_codes'].unique())

[0 1 2 3]


In [28]:
marital_status_median = census['marital_status_codes'].median()
print(marital_status_median)

1.0


In [27]:
census.head()

Unnamed: 0,first_name,last_name,birth_year,voted,num_children,income_year,higher_tax,marital_status,higher_tax_codes,marital_status_codes
0,Denise,Ratke,2005,False,0,92129.41,disagree,single,1,0
1,Hali,Cummerata,1987,False,0,75649.17,neutral,divorced,2,1
2,Salomon,Orn,1992,True,2,166313.45,agree,single,3,0
3,Sarina,Schiller,1965,False,2,71704.81,strongly agree,married,4,2
4,Gust,Abernathy,1945,False,2,143316.08,agree,married,3,2


We most typical marital status is single!

Create a new variable called age_group, which groups respondents based on their birth year. The groups should be in five-year increments, e.g., 25-30, 31-35, etc. Then label encode the age_group variable to assist the Census team in the event they would like to use machine learning to predict if a respondent thinks the wealthy should pay higher taxes based on their age group.