## BourbonBaked: Data Enrichment

A new company, BourbonBaked, is launching a new health product offering. Through beta testing they were able to collect user demographic information, but they want as much information as they can get about the income levels of these individuals and their general health so they can create the best lookalike audiences possible for their national product launch next month. As such, they have tasked us with cleaning up their user file and adding in some interesting data to see if we can create some look a like audiences for them to target. 

---

In [20]:
#importing packages needed
import requests
import pandas as pd
import json
import creds

### Accessing User Data File 

Because this is a project about showing the possibilities as opposed to a real company, our user data is generated from an API instead of a file export from a CRM. For this reason, the next few steps go through a little cleanup to get us a usable data set.

In [21]:
#accessing the API to get user data
r = requests.get('https://peoplegeneratorapi.live/api/person/1000')
print(r.status_code)
print(r.json())

200
[{'name': 'Seymour Mosciski', 'age': 4, 'job': 'Student', 'incomeUSD': 0, 'creditScore': 0, 'ccNumber': None, 'married': False, 'hasChildren': False, 'height': 160.0, 'weight': 54.5, 'eyeColor': 'GRAY', 'email': 'mosciski@gmail.com', 'gender': 'Female', 'hasDegree': False, 'bloodType': 'B+', 'username': 'seymour38', 'politicalLeaning': -0.31, 'religion': 'Christianity', 'address': {'streetAddress': '993 Basil Knolls', 'city': 'Manjo', 'state': 'West', 'country': 'Cameroon', 'zip': '8379', 'geonameId': 2228373, 'phoneNumber': '+237 682959158', 'ipAddress': '147.60.212.143', 'countryCode': 'CM'}, 'doB': 'Wed Nov 27 10:41:47 UTC 2019', 'gpa': 2.6}, {'name': 'Emil Blanda', 'age': 1, 'job': 'Student', 'incomeUSD': 0, 'creditScore': 0, 'ccNumber': None, 'married': False, 'hasChildren': False, 'height': 175.0, 'weight': 75.5, 'eyeColor': 'BROWN', 'email': 'emil_blanda@gmail.com', 'gender': 'Other', 'hasDegree': False, 'bloodType': 'A+', 'username': 'emil77', 'politicalLeaning': -0.58, 're

In [22]:
#viewing the data as a dataframe
data=r.json()
df = (pd.DataFrame(data))
df.head()

Unnamed: 0,name,age,job,incomeUSD,creditScore,ccNumber,married,hasChildren,height,weight,...,email,gender,hasDegree,bloodType,username,politicalLeaning,religion,address,doB,gpa
0,Seymour Mosciski,4,Student,0,0,,False,False,160.0,54.5,...,mosciski@gmail.com,Female,False,B+,seymour38,-0.31,Christianity,"{'streetAddress': '993 Basil Knolls', 'city': ...",Wed Nov 27 10:41:47 UTC 2019,2.6
1,Emil Blanda,1,Student,0,0,,False,False,175.0,75.5,...,emil_blanda@gmail.com,Other,False,A+,emil77,-0.58,Islam,"{'streetAddress': '7329 Witting Spur', 'city':...",Fri Mar 17 21:26:09 UTC 2023,3.1
2,Neal Dooley,33,Miner,57994,805,6553-4415-1720-1590,False,True,161.0,60.5,...,dooley@gmail.com,Male,False,A-,neal91,-0.83,Hinduism,"{'streetAddress': '00154 Bergstrom Lakes', 'ci...",Sat Nov 30 20:57:23 UTC 1991,2.5
3,Frankie Walter,13,Student,0,0,,False,False,145.0,55.7,...,frankie@gmail.com,Female,False,A-,frankie90,-0.22,Islam,"{'streetAddress': '463 Gretta Haven', 'city': ...",Wed Aug 04 13:04:50 UTC 2010,2.9
4,Zackary Waters,10,Student,0,0,,False,False,166.0,80.9,...,waters@gmail.com,Male,False,B+,zackary87,-0.48,Islam,"{'streetAddress': '2631 Michiko Courts', 'city...",Sat May 10 05:52:23 UTC 2014,0.0


This is great but it looks like Address is still coming through in a json format, so let's unpack that. 

In [34]:
#normalizing the address data
df = pd.json_normalize(data)

#getting a list of all the column names
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 29 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   name                   1000 non-null   object 
 1   age                    1000 non-null   int64  
 2   job                    1000 non-null   object 
 3   incomeUSD              1000 non-null   int64  
 4   creditScore            1000 non-null   int64  
 5   ccNumber               690 non-null    object 
 6   married                1000 non-null   bool   
 7   hasChildren            1000 non-null   bool   
 8   height                 1000 non-null   float64
 9   weight                 1000 non-null   float64
 10  eyeColor               1000 non-null   object 
 11  email                  1000 non-null   object 
 12  gender                 1000 non-null   object 
 13  hasDegree              1000 non-null   bool   
 14  bloodType              1000 non-null   object 
 15  usern

Since we know we are only going to market this product to adults, we need to remove all children from the dataset 

In [35]:
#removing individuals under the age of 18 
df.drop(df[df['age'] < 18].index, inplace=True)

df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 690 entries, 2 to 999
Data columns (total 29 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   name                   690 non-null    object 
 1   age                    690 non-null    int64  
 2   job                    690 non-null    object 
 3   incomeUSD              690 non-null    int64  
 4   creditScore            690 non-null    int64  
 5   ccNumber               690 non-null    object 
 6   married                690 non-null    bool   
 7   hasChildren            690 non-null    bool   
 8   height                 690 non-null    float64
 9   weight                 690 non-null    float64
 10  eyeColor               690 non-null    object 
 11  email                  690 non-null    object 
 12  gender                 690 non-null    object 
 13  hasDegree              690 non-null    bool   
 14  bloodType              690 non-null    object 
 15  userna

As we can see the file is now smaller and excludes anyone under the age of 18. 

Additionally, we only want a file of United States residents since our mock company is US based and only marketing to those areas.  

In [36]:
#removing non USA based users
df = df[df['address.country'] == 'United States']

df.head()

Unnamed: 0,name,age,job,incomeUSD,creditScore,ccNumber,married,hasChildren,height,weight,...,gpa,address.streetAddress,address.city,address.state,address.country,address.zip,address.geonameId,address.phoneNumber,address.ipAddress,address.countryCode
73,Lynne Waelchi,54,Biochemist,85195,572,6759-8200-5173-9969-802,True,True,151.0,64.2,...,3.4,2546 Dung Gardens,Ferndale,Arkansas,United States,63637-7263,4509177,+1 65878161495,130.75.166.4,US
184,Phillip Oberbrunner,50,Employee Relations Manager,99924,708,6007-2212-4198-9955,False,True,165.0,64.4,...,1.9,534 Columbus Plains,Abington,Florida,United States,04536,5358736,+1 34867760121,228.72.55.130,US
231,Delois Koelpin,23,Personal Assistant,54950,572,6759-7038-7890-2819-469,False,False,150.0,59.1,...,2.4,418 Chet Via,Sunnyside,Georgia,United States,48010-3294,4354256,+1 46348544022,21.27.122.37,US
263,Alisha Kris,28,Firefighter,79088,535,4870621957929,True,False,161.0,64.5,...,2.4,1589 Ratke Shoal,Temecula,New Jersey,United States,43695,2204582,+679 0316820,26.90.112.62,FJ
484,Randall Herzog,28,Veterinary Technician,40140,460,6771338844834059,True,False,150.0,44.7,...,3.0,0347 Hayes Dale,Sherman Oaks,New York,United States,77604-0830,5265499,+1 83067073161,229.62.195.18,US


Great! Now that we have a complete intial user file, time to enrich it

---

### Enriching the Data

We know BourbonBaked is interested in the vital statistics of their users since this is a weight loss product. They are interested in knowing if their beta users fell into any significant category. Since we have height and weight already, we can use this information to get their BMI. 

The BMI API we are using requires height in inches and weight in pounds, so we will first need to update this in our dataframe as well before hitting the API. Currently, the Height is in CM and the Weight is in KG. 

- **Height**: Convert from CM to In
- **Weight**: Convert from KG to LB

To convert from CM to IN, we need to divide the Height column values by 2.54 in order to get the correct value in inches 
To convert from KG to LB, we need to multiply the Weight column values by 2.205 in order to get the correct value in pounds

In [37]:
#adding height in inches and weight in pounds
df['height_inches'] = df['height'] / 2.54

df['weight_pounds'] = df['weight'] * 2.205

In [38]:
#generating just the columns needed to double check
bmi_columns_needed = df[['age','height','weight','height_inches','weight_pounds']]
print(bmi_columns_needed)

     age  height  weight  height_inches  weight_pounds
73    54   151.0    64.2      59.448819       141.5610
184   50   165.0    64.4      64.960630       142.0020
231   23   150.0    59.1      59.055118       130.3155
263   28   161.0    64.5      63.385827       142.2225
484   28   150.0    44.7      59.055118        98.5635
574   40   165.0    59.2      64.960630       130.5360
694   38   170.0    70.0      66.929134       154.3500
735   36   160.0    69.8      62.992126       153.9090
763   33   176.0    59.2      69.291339       130.5360
937   34   155.0    70.2      61.023622       154.7910
993   41   150.0    45.5      59.055118       100.3275


#### Appending in BMI Data 

In [39]:
#BMI function for API
def calculate_bmi(row):
    url = "https://fitness-calculator.p.rapidapi.com/bmi"
    querystring = {
        "age": row['age'],
        "weight": row['weight'],
        "height": row['height']
    }
    headers = {
        "X-RapidAPI-Key": creds.api_key,
        "X-RapidAPI-Host": "fitness-calculator.p.rapidapi.com"
    }
    response = requests.get(url, headers=headers, params=querystring)
    data = response.json()
    
    # Check if 'data' key is present in the API response
    if 'data' in data:
        bmi = data['data'].get('bmi', None)
        health = data['data'].get('health', None)
        healthy_bmi_range = data['data'].get('healthy_bmi_range', None)
        return bmi, health, healthy_bmi_range
    else:
        return None, None, None  

#add new columns to the DataFrame
df['bmi'], df['health'], df['healthy_bmi_range'] = zip(*df.apply(calculate_bmi, axis=1))

# Display the updated DataFrame
df.head()

Unnamed: 0,name,age,job,incomeUSD,creditScore,ccNumber,married,hasChildren,height,weight,...,address.zip,address.geonameId,address.phoneNumber,address.ipAddress,address.countryCode,height_inches,weight_pounds,bmi,health,healthy_bmi_range
73,Lynne Waelchi,54,Biochemist,85195,572,6759-8200-5173-9969-802,True,True,151.0,64.2,...,63637-7263,4509177,+1 65878161495,130.75.166.4,US,59.448819,141.561,28.16,Overweight,18.5 - 25
184,Phillip Oberbrunner,50,Employee Relations Manager,99924,708,6007-2212-4198-9955,False,True,165.0,64.4,...,04536,5358736,+1 34867760121,228.72.55.130,US,64.96063,142.002,23.65,Normal,18.5 - 25
231,Delois Koelpin,23,Personal Assistant,54950,572,6759-7038-7890-2819-469,False,False,150.0,59.1,...,48010-3294,4354256,+1 46348544022,21.27.122.37,US,59.055118,130.3155,26.27,Overweight,18.5 - 25
263,Alisha Kris,28,Firefighter,79088,535,4870621957929,True,False,161.0,64.5,...,43695,2204582,+679 0316820,26.90.112.62,FJ,63.385827,142.2225,24.88,Normal,18.5 - 25
484,Randall Herzog,28,Veterinary Technician,40140,460,6771338844834059,True,False,150.0,44.7,...,77604-0830,5265499,+1 83067073161,229.62.195.18,US,59.055118,98.5635,19.87,Normal,18.5 - 25


Now that we have added in BMI, Health indicator, and the healthy BMI range, I would like to make the data set more robust to make insights easier. BourbonBaked also mentioned wanting to know income levels to see if their hitting their anticipated target demographic. 

First, I will take the income column and add a new range that signifies lower, middle, or upper class. 

In [20]:
def categorize_income_level(incomeUSD):
    income = int(incomeUSD) 
    
    if income < 52200:
        return 'Low income'
    elif 52200 <= income < 156600:
        return 'Middle income'
    else:
        return 'Upper income'

# Add a new column 'income level' to the DataFrame with the income level categories
df['income level'] = df['incomeUSD'].apply(categorize_income_level)

df.head()


Unnamed: 0,name,age,job,incomeUSD,creditScore,ccNumber,married,hasChildren,height,weight,...,address.geonameId,address.phoneNumber,address.ipAddress,address.countryCode,height_inches,weight_pounds,bmi,health,healthy_bmi_range,income level
456,Wilton Rosenbaum,25,Diplomat,181034,590,6771-8977-7245-1798,False,False,156.0,45.6,...,4130430,+1 79089259450,157.126.46.66,US,61.417323,100.548,18.74,Normal,18.5 - 25,Upper income


Next lets add an indicator for where in the United States an individual lives. 

In [40]:
regions = {
    "Northeast": ["Connecticut", "Maine","Massachusetts", "New Hampshire", "Rhode Island", "Vermont", "New Jersey", "New York", "Pennsylvania"], 
    "Midwest": ["Indiana", "Illinois", "Michigan", "Ohio", "Wisconsin", "Iowa", "Kansas", "Minnesota", "Missouri", "Nebraska", "North Dakota", "South Dakota"],
    "South" : ["Deleware", "District of Columbia", "Florida", "Georgia", "Maryland", "North Carolina", "South Carolina", "Virginia", "West Virginia", "Alabama", "Kentucky", "Mississippi", "Tennessee", "Arkansas", "Louisiana", "Oklahoma", "Texas"],
    "West" : ["Arizona", "Colorada", "Idaho", "New Mexico", "Montana", "Utah", "Nevada", "Wyoming", "Alaska", "California", "Hawaii", "Oregon", "Washington"]}

# Function to map states to regions
def map_state_to_region(state):
    for region, states_list in regions.items():
        if state in states_list:
            return region
    return 'Unknown' 

# Add the "Regions" column to the DataFrame using the map function
df['region'] = df['address.state'].map(map_state_to_region)

df.head()

Unnamed: 0,name,age,job,incomeUSD,creditScore,ccNumber,married,hasChildren,height,weight,...,address.geonameId,address.phoneNumber,address.ipAddress,address.countryCode,height_inches,weight_pounds,bmi,health,healthy_bmi_range,region
73,Lynne Waelchi,54,Biochemist,85195,572,6759-8200-5173-9969-802,True,True,151.0,64.2,...,4509177,+1 65878161495,130.75.166.4,US,59.448819,141.561,28.16,Overweight,18.5 - 25,South
184,Phillip Oberbrunner,50,Employee Relations Manager,99924,708,6007-2212-4198-9955,False,True,165.0,64.4,...,5358736,+1 34867760121,228.72.55.130,US,64.96063,142.002,23.65,Normal,18.5 - 25,South
231,Delois Koelpin,23,Personal Assistant,54950,572,6759-7038-7890-2819-469,False,False,150.0,59.1,...,4354256,+1 46348544022,21.27.122.37,US,59.055118,130.3155,26.27,Overweight,18.5 - 25,South
263,Alisha Kris,28,Firefighter,79088,535,4870621957929,True,False,161.0,64.5,...,2204582,+679 0316820,26.90.112.62,FJ,63.385827,142.2225,24.88,Normal,18.5 - 25,Northeast
484,Randall Herzog,28,Veterinary Technician,40140,460,6771338844834059,True,False,150.0,44.7,...,5265499,+1 83067073161,229.62.195.18,US,59.055118,98.5635,19.87,Normal,18.5 - 25,Northeast


We'd also like to add in some state specific information for our analysis, so we'll use another API to bring in data for all states

In [41]:
import requests

url2 = "https://us-states.p.rapidapi.com/basic"

headers = {
	"X-RapidAPI-Key": creds.api_key,
	"X-RapidAPI-Host": "us-states.p.rapidapi.com"
}

response2 = requests.get(url2, headers=headers)

print(response2.json())


[{'name': 'Alabama', 'postal': 'AL', 'capital': {'name': 'Montgomery', 'latitude': '32.377716', 'longitude': '-86.300568'}, 'population': {'density_km': '37', 'total': '5024279', 'density_mi': '95'}}, {'name': 'Alaska', 'postal': 'AK', 'capital': {'name': 'Juneau', 'latitude': '58.301598', 'longitude': '-134.420212'}, 'population': {'density_km': '<1', 'total': '733391', 'density_mi': '1'}}, {'name': 'American Samoa', 'postal': 'AS', 'capital': {'name': 'Pago Pago', 'latitude': '-14.279444', 'longitude': '-170.700556'}, 'population': {'density_km': '279', 'total': '49710', 'density_mi': '721'}}, {'name': 'Arizona', 'postal': 'AZ', 'capital': {'name': 'Phoenix', 'latitude': '33.448143', 'longitude': '-112.096962'}, 'population': {'density_km': '23', 'total': '7151502', 'density_mi': '60'}}, {'name': 'Arkansas', 'postal': 'AR', 'capital': {'name': 'Little Rock', 'latitude': '34.746613', 'longitude': '-92.288986'}, 'population': {'density_km': '22', 'total': '3011524', 'density_mi': '57'}

In [42]:
#viewing the data as a dataframe
data2=response2.json()
df_state = (pd.DataFrame(data2))
df_state.head()

Unnamed: 0,name,postal,capital,population
0,Alabama,AL,"{'name': 'Montgomery', 'latitude': '32.377716'...","{'density_km': '37', 'total': '5024279', 'dens..."
1,Alaska,AK,"{'name': 'Juneau', 'latitude': '58.301598', 'l...","{'density_km': '<1', 'total': '733391', 'densi..."
2,American Samoa,AS,"{'name': 'Pago Pago', 'latitude': '-14.279444'...","{'density_km': '279', 'total': '49710', 'densi..."
3,Arizona,AZ,"{'name': 'Phoenix', 'latitude': '33.448143', '...","{'density_km': '23', 'total': '7151502', 'dens..."
4,Arkansas,AR,"{'name': 'Little Rock', 'latitude': '34.746613...","{'density_km': '22', 'total': '3011524', 'dens..."


In [43]:
df_state = pd.json_normalize(data2)

df_state.head()

Unnamed: 0,name,postal,capital.name,capital.latitude,capital.longitude,population.density_km,population.total,population.density_mi
0,Alabama,AL,Montgomery,32.377716,-86.300568,37,5024279,95
1,Alaska,AK,Juneau,58.301598,-134.420212,<1,733391,1
2,American Samoa,AS,Pago Pago,-14.279444,-170.700556,279,49710,721
3,Arizona,AZ,Phoenix,33.448143,-112.096962,23,7151502,60
4,Arkansas,AR,Little Rock,34.746613,-92.288986,22,3011524,57


In [53]:
df_state.rename(columns={'name': 'state'}, inplace=True)

df_state.head()

Unnamed: 0,state,postal,capital.name,capital.latitude,capital.longitude,population.density_km,population.total,population.density_mi
0,Alabama,AL,Montgomery,32.377716,-86.300568,37,5024279,95
1,Alaska,AK,Juneau,58.301598,-134.420212,<1,733391,1
2,American Samoa,AS,Pago Pago,-14.279444,-170.700556,279,49710,721
3,Arizona,AZ,Phoenix,33.448143,-112.096962,23,7151502,60
4,Arkansas,AR,Little Rock,34.746613,-92.288986,22,3011524,57


I'd like to add population total into the existing dataset we have so we can compare the state population to extrapolate potential new user data

In [61]:
#perform merge based on the state names
merged_df = df.merge(df_state[['state', 'population.total']], left_on='address.state', right_on='state', how='left')

#rename the 'population.total' column 
merged_df.rename(columns={'population.total': 'state_population'}, inplace=True)

print(merged_df)

                     name  age  \
0           Lynne Waelchi   54   
1     Phillip Oberbrunner   50   
2          Delois Koelpin   23   
3             Alisha Kris   28   
4          Randall Herzog   28   
5          Leanne Hilpert   40   
6           Kelsey Jacobi   38   
7        Petronila Walter   36   
8         Anthony Treutel   33   
9        Duane Schowalter   34   
10  Lieselotte Vandervort   41   

                                               job  incomeUSD  creditScore  \
0                                       Biochemist      85195          572   
1                       Employee Relations Manager      99924          708   
2                               Personal Assistant      54950          572   
3                                      Firefighter      79088          535   
4                            Veterinary Technician      40140          460   
5                                       Biochemist      91174          469   
6                       Employee Relations Ma

In [2]:
merged_df.info()

NameError: name 'merged_df' is not defined

There is one last data point to add in, and that is the Per Capita Personal Income for each state to append into the file. This allows us to see how a user compares within their state for state segmented marketing (if needed). 

In [3]:
df_state_income = pd.read_csv('data_file\states.csv')
df_state_income.head()

NameError: name 'pd' is not defined

In [None]:
#adding in PCPI (Per Capita Personal Income) to the merged dataframe

merged_df = merged_df.merge(df_state_income, left_on='state', right_on='State', how='left')

merged_df.rename(columns={'PCPI ($)': 'state_per_capita_income'}, inplace=True)

merged_df.info()

In [None]:
#cleaning up the final dataframe 

merged_df.drop(columns = ['Rank', 'FIPS Code', 'State', 'state'],axis=1, inplace=True)

print(merged_df.columns)

In [None]:
#exporting final dataframe as csv for further analysis 

merged_df.to_csv('enriched_customer_file.csv')

That's a wrap on data wrangling! The cleaned final file is now available in a csv export. In *data_vizualization.ipynb* we'll explore the data set as a whole and break it into segments to see what insights we can garner. 