# Census Variables
We have decided to volunteer for our local community by offering to clean their recently collected census data. 
The census dataframe is composed of simulated census data to represent demographics of a small community in the U.S. The description of this dataset is as follows:
 - **first_name:**	The respondent’s first name.
- **last_name:**	The respondent’s last name.
- **birth_year:**	The respondent’s year of birth.
- **voted:**	If the respondent participated in the current voting cycle.
- **num_children:**	The number of children the respondent has.
- **income_year:**	The average yearly income the respondent earns.
- **higher_tax:**	The respondent’s answer to the question: “Rate your agreement with the statement: the wealthy should pay higher taxes.”
- **marital_status:**	The respondent’s current marital status.
 

In [66]:
#Import libraries
import numpy as np
import pandas as pd
pd.set_option('display.max_columns', None)

In [67]:
# Read in the census dataframe
census = pd.read_csv('census_data.csv', index_col=0)
#Calling the first 5 rows of the DataFrame
print(census.head())

  first_name  last_name birth_year  voted  num_children  income_year  \
0     Denise      Ratke       2005  False             0     92129.41   
1       Hali  Cummerata       1987  False             0     75649.17   
2    Salomon        Orn       1992   True             2    166313.45   
3     Sarina   Schiller       1965  False             2     71704.81   
4       Gust  Abernathy       1945  False             2    143316.08   

       higher_tax marital_status  
0        disagree         single  
1         neutral       divorced  
2           agree         single  
3  strongly agree        married  
4           agree        married  


In [68]:
#Checking the data types of every column
print(census.dtypes)

first_name         object
last_name          object
birth_year         object
voted                bool
num_children        int64
income_year       float64
higher_tax         object
marital_status     object
dtype: object


## Inspecting Data Types
The manager of the census would like to know the average birth year of the respondents. We were able to see from .dtypes that birth_year has been assigned the str datatype whereas it should be expressed in int. Let's see the unique values of the variable using the .unique() method.

In [69]:
#Checking the unique values of the birth year column
print(census['birth_year'].unique())

['2005' '1987' '1992' '1965' '1945' '1951' '1963' '1949' '1950' '1971'
 '2007' '1944' '1995' '1973' '1946' '1954' '1994' '1989' '1947' '1993'
 '1976' '1984' 'missing' '1966' '1941' '2000' '1953' '1956' '1960' '2001'
 '1980' '1955' '1985' '1996' '1968' '1979' '2006' '1962' '1981' '1959'
 '1977' '1978' '1983' '1957' '1961' '1982' '2002' '1998' '1999' '1952'
 '1940' '1986' '1958']


## Altering Data
There appears to be a missing value in the birth_year column. With some research, we found that the respondent’s birth year is 1967. Let's replace the missing values with this number and change the column to a int data type. 

In [70]:
census['birth_year'] = census['birth_year'].replace(['missing'], 1967)
print(census['birth_year'].unique())
#Changing the birth_year type to int
census['birth_year'] = census['birth_year'].astype('int')
print(census['birth_year'].dtypes)

['2005' '1987' '1992' '1965' '1945' '1951' '1963' '1949' '1950' '1971'
 '2007' '1944' '1995' '1973' '1946' '1954' '1994' '1989' '1947' '1993'
 '1976' '1984' 1967 '1966' '1941' '2000' '1953' '1956' '1960' '2001'
 '1980' '1955' '1985' '1996' '1968' '1979' '2006' '1962' '1981' '1959'
 '1977' '1978' '1983' '1957' '1961' '1982' '2002' '1998' '1999' '1952'
 '1940' '1986' '1958']
int64


Our Manager wants to know the average birth year of the respondents of the census 

In [71]:
print(census['birth_year'].mean())

1973.4



## Ordering High Tax Values
Our manager would like to set an order to the higher_tax variable so that: strongly disagree < disagree < neutral < agree < strongly agree.Let's convert the higher_tax variable to the category data type with the appropriate order, then print the new order using the .unique() method.

In [72]:
print(census['higher_tax'].unique())

['disagree' 'neutral' 'agree' 'strongly agree' 'strongly disagree']


In [73]:
census['higher_tax'] = pd.Categorical(census['higher_tax'], ['strongly disagree', 'disagree', 'neutral', 'agree', 'strongly agree'], ordered=True)
print(census['higher_tax'].unique())

['disagree', 'neutral', 'agree', 'strongly agree', 'strongly disagree']
Categories (5, object): ['strongly disagree' < 'disagree' < 'neutral' < 'agree' < 'strongly agree']


In [74]:
census['higher_tax_codes'] = census['higher_tax'].cat.codes
print(census.head())

  first_name  last_name  birth_year  voted  num_children  income_year  \
0     Denise      Ratke        2005  False             0     92129.41   
1       Hali  Cummerata        1987  False             0     75649.17   
2    Salomon        Orn        1992   True             2    166313.45   
3     Sarina   Schiller        1965  False             2     71704.81   
4       Gust  Abernathy        1945  False             2    143316.08   

       higher_tax marital_status  higher_tax_codes  
0        disagree         single                 1  
1         neutral       divorced                 2  
2           agree         single                 3  
3  strongly agree        married                 4  
4           agree        married                 3  


## Checking median of High Tax Values 
Our manager would also like to know the median sentiment of the respondents on the issue of higher taxes for the wealthy. Label encode the higher_tax variable and print the median using the pandas .catcodes & .median() methods.

In [75]:
median_index = np.median(census['higher_tax_codes'])
print(median_index)

2.0


In [76]:
median_index_status = census.higher_tax[int(median_index)]
print(median_index_status)

agree


## One-Hot-Encoding Marital Status
Let's create a new variable called marital_codes by Label Encoding the marital_status variable. This could help the Census team use machine learning to predict if a respondent thinks the wealthy should pay higher taxes based on their marital status.

In [77]:
print(census.marital_status.unique())

['single' 'divorced' 'married' 'widowed']


In [78]:
census.marital_status = pd.Categorical(census.marital_status, ['single', 'married', 'divorced', 'widowed'], ordered=True)
print(census.marital_status.unique())

['single', 'divorced', 'married', 'widowed']
Categories (4, object): ['single' < 'married' < 'divorced' < 'widowed']


In [79]:
census['marital_codes'] = census.marital_status.cat.codes
print(census.head())

  first_name  last_name  birth_year  voted  num_children  income_year  \
0     Denise      Ratke        2005  False             0     92129.41   
1       Hali  Cummerata        1987  False             0     75649.17   
2    Salomon        Orn        1992   True             2    166313.45   
3     Sarina   Schiller        1965  False             2     71704.81   
4       Gust  Abernathy        1945  False             2    143316.08   

       higher_tax marital_status  higher_tax_codes  marital_codes  
0        disagree         single                 1              0  
1         neutral       divorced                 2              2  
2           agree         single                 3              0  
3  strongly agree        married                 4              1  
4           agree        married                 3              1  


In [80]:
marital_codes_median = np.median(census.marital_codes)
print(marital_codes_median)

1.0


In [81]:
marital_codes_median_status = census.marital_status[int(marital_codes_median)]
print(marital_codes_median_status)

divorced


Let's One-Hot Encode marital_status to create binary variables of each category

In [82]:
census = pd.get_dummies(data= census, columns= ['marital_status'])
print(census.head())

  first_name  last_name  birth_year  voted  num_children  income_year  \
0     Denise      Ratke        2005  False             0     92129.41   
1       Hali  Cummerata        1987  False             0     75649.17   
2    Salomon        Orn        1992   True             2    166313.45   
3     Sarina   Schiller        1965  False             2     71704.81   
4       Gust  Abernathy        1945  False             2    143316.08   

       higher_tax  higher_tax_codes  marital_codes  marital_status_single  \
0        disagree                 1              0                      1   
1         neutral                 2              2                      0   
2           agree                 3              0                      1   
3  strongly agree                 4              1                      0   
4           agree                 3              1                      0   

   marital_status_married  marital_status_divorced  marital_status_widowed  
0                    

## Age group categories 
Let's create a new variable called age_group, which groups respondents based on their birth year. The groups should be in five-year increments, e.g., 25-30, 31-35, etc. Then label encode the age_group variable to assist the Census team in the event they would like to use machine learning to predict if a respondent thinks the wealthy should pay higher taxes based on their age group.

In [83]:
census['age'] = 2022 - census['birth_year']

In [84]:
age_bins = np.arange(min(census['age']) - 5, 100, 5)

In [85]:
census['age_group'] = pd.cut(census['age'], bins=age_bins)
print(census.head())

  first_name  last_name  birth_year  voted  num_children  income_year  \
0     Denise      Ratke        2005  False             0     92129.41   
1       Hali  Cummerata        1987  False             0     75649.17   
2    Salomon        Orn        1992   True             2    166313.45   
3     Sarina   Schiller        1965  False             2     71704.81   
4       Gust  Abernathy        1945  False             2    143316.08   

       higher_tax  higher_tax_codes  marital_codes  marital_status_single  \
0        disagree                 1              0                      1   
1         neutral                 2              2                      0   
2           agree                 3              0                      1   
3  strongly agree                 4              1                      0   
4           agree                 3              1                      0   

   marital_status_married  marital_status_divorced  marital_status_widowed  \
0                   