In [1]:
#manipulate dataframes in python
import pandas as pd
import numpy as np

#make API calls with python
import requests

#allows us to store results of API call cleanly
import json

import math
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

from collections import OrderedDict 

In [2]:
api_key = "a45a33bca80b5e1907adc4587be5c346b897dda7"

# Year 2000

Before the 2005 or ACS period, Census only published the city population estimate when it was Census year, which are 1990 and 2000.

For Year 2000, 2000 Decennial: Summary File 1 Dataset may give us the variable that we can use it. 

https://api.census.gov/data/2000/sf1/variables.html

First, check how many cities(Place FIPS CODE) are there.

For Census population, I can use P001001 -->Population:Total <br>to know total population and how many Cities are in the dataset. 

In [3]:
year='2000'
dsource='sf1'
cols = 'NAME,P001001'
base_url = f'https://api.census.gov/data/{year}/{dsource}?get={cols}'
data_url= f'{base_url}&for=place:*&in=state:*&key={api_key}'
response = requests.get(data_url)
data=response.json()
# Check how many place code in the each API
print(year, len(data))
    




2000 25151


In [4]:
data[:5]

[['NAME', 'P001001', 'state', 'place'],
 ['Altoona town', '984', '01', '01660'],
 ['Boaz city', '7411', '01', '07912'],
 ['Calera city', '3158', '01', '11416'],
 ['Childersburg city', '4927', '01', '14464']]

In [5]:
cols = ['NAME', 'Census_population', 'fips_state', 'fips_place']
city_2000 = pd.DataFrame(data[1:], columns=cols)

In [6]:
city_2000.head()

Unnamed: 0,NAME,Census_population,fips_state,fips_place
0,Altoona town,984,1,1660
1,Boaz city,7411,1,7912
2,Calera city,3158,1,11416
3,Childersburg city,4927,1,14464
4,Decatur city,53929,1,20104


There are 25,151 cities in the Year 2000 SF1 dataset. However, we do not need all these cities because our main source of the 2005-2018 dataset is ACS1, which I got 631 cities maximum in 2018, and the data checking source is NYU City Health Dataset, which is 510 cities. 
Here, I will combine cities from the 2018 ACS1 and NYU dataset to create what we need at the most from these 25,151 cities. For example, 631 cities from ACS1 cities (it has 495 cities of NYU) and the remaining 15 cities from NYU, so a total of 645 cities.  
* Correction : Due to Honolulu County code is included in NYU City Health dataset, I removed from NYU city list because it is not Place FIPS Code. So, our total cities will be 644 cities.

In [7]:
## Establishiing 644 cities list.

# Creating NYU_500_Largest to find out whether NYU dataset has this city or not.

# Pull out NYU FIPS 
# Preselect column type as object
coltype = {'fips_state':object,
          'fips_place':object,
           'fips_state_place':object}
nyu = pd.read_csv(r"/Users/jasonlim/Desktop/EFGS/Census-master/CitiesDataset/City_Health_data/nyu_fips_code.csv",\
                  dtype= coltype)
nyu_cities = list(nyu['fips_state_place'])

# pull out 2010-2018 data and filter only 2018 data.
place_pop = pd.read_csv("place_pop_2010_2018_asrh.csv", dtype=coltype)

place_pop_2018 = place_pop[place_pop['Year']==2018]
place_pop_2018_cities = list(place_pop_2018['fips_state_place'])

In [8]:
len(nyu_cities)

510

In [9]:
len(place_pop_2018_cities)

630

In [10]:
# Checking which cities from NYU are not in 2018 dataset.

set(nyu_cities).difference(place_pop_2018_cities)

{'0015003',
 '1349000',
 '3408920',
 '3413360',
 '3420350',
 '3426340',
 '3429430',
 '3439420',
 '3446680',
 '3457750',
 '3459640',
 '3465490',
 '5010675',
 '5414600',
 '5613900'}

0015003 is Honolulu, Hawaii, but it is FIPS county Code, so I will erase this from NYU because we have 1571550	Urban Honolulu CDP in place. 



In [11]:
nyu_cities.remove('0015003')

In [12]:
result_cities= list(set(nyu_cities) | set(place_pop_2018_cities))

In [13]:
len(result_cities)

644

In [14]:
# Creating Fips_state_place key for joining dataframe

city_2000['fips_state_place'] = city_2000['fips_state']+city_2000['fips_place']

In [15]:
city_2000.head()

Unnamed: 0,NAME,Census_population,fips_state,fips_place,fips_state_place
0,Altoona town,984,1,1660,101660
1,Boaz city,7411,1,7912,107912
2,Calera city,3158,1,11416,111416
3,Childersburg city,4927,1,14464,114464
4,Decatur city,53929,1,20104,120104


In [16]:
city_2000_first = city_2000[city_2000['fips_state_place'].isin(result_cities)]

In [17]:
len(city_2000_first)

625

After I filtered out, there are 625 cities left from 2000 Dataset. This is due to change of FIPS CODE after the 2000 Census.  <br>
For example, FIPs CODE '1349000' is in the NYU dataset, but not in the 2018 FIPS CODE census dataset. 

So, this difference is understandable.<br>

In [18]:
city_2000_first['NYU_500_Largest'] = city_2000_first['fips_state_place'].apply(lambda x:1 if x in nyu_cities else 0)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


In [19]:
city_2000_first.head()

Unnamed: 0,NAME,Census_population,fips_state,fips_place,fips_state_place,NYU_500_Largest
5,Dothan city,57737,1,21184,121184,0
32,Auburn city,42987,1,3076,103076,0
42,Birmingham city,242820,1,7000,107000,1
69,Hoover city,62742,1,35896,135896,1
179,Huntsville city,158216,1,37000,137000,1


## Before establishing each category dataset,

Each category may have different formats and some categories may require calculations (i.e. Age Grouping).<br>
Since this is just a single year dataset, I may get everything in a single API run with all variables that we need. Yet, I may not know the format and how many non-value with this method. Also, the calculation will not perform efficiently with too many columns. So, I will divide this into 3 parts: <br>

1. Check variables for each category and get suitable variables for our dataset. Also, check category variables for whether it's required calculation or not. For example, the Sex category does not need calculation, but for the age grouping of 65 years or over, there would be several variables and I will combine it.
2. For non-calculation required variables, I will pull these variables together with a single API run.
3. For calculation required variables, I will pull these variables separately and calculate to create the fields we want. Then, I will combine it with my Year 2000 dataset.

## Sex



For Sex, 2000 SF1 has two main variables. <Br>

P012002-->	SEX BY AGE:Total Male <br>
P012026-->	SEX BY AGE:Total Female <BR>
    
    
Also, I want to check the format of these two variables. 


In [20]:
year='2000'
dsource='sf1'
cols = 'NAME,P012002,P012026'
base_url = f'https://api.census.gov/data/{year}/{dsource}?get={cols}'
data_url= f'{base_url}&for=place:*&in=state:*&key={api_key}'
response = requests.get(data_url)
data=response.json()
# Check how many place code in the each API
print(year, len(data))
    




2000 25151


In [21]:
# check the format of these two variables.

data[:5]

[['NAME', 'P012002', 'P012026', 'state', 'place'],
 ['Altoona town', '433', '551', '01', '01660'],
 ['Boaz city', '3379', '4032', '01', '07912'],
 ['Calera city', '1522', '1636', '01', '11416'],
 ['Childersburg city', '2251', '2676', '01', '14464']]

**Its format is integer.**

## Race
For Race, 2000 SF1 has five main variables. <Br>
    
    
P007002 -->RACE:White alone <br>
P007003 --> RACE:Bl/AfAm alone <br>
P007005 --> RACE:Asian alone <Br>
P007004 -->RACE:AmInd/AK alone <Br>
P007006 --> RACE:HI alone <br>

In [22]:
year='2000'
dsource='sf1'
cols = 'NAME,P007002,P007003,P007005,P007004,P007006'
base_url = f'https://api.census.gov/data/{year}/{dsource}?get={cols}'
data_url= f'{base_url}&for=place:*&in=state:*&key={api_key}'
response = requests.get(data_url)
data=response.json()
# Check how many place code in the each API
print(year, len(data))
    




2000 25151


In [23]:
data[:5]

[['NAME',
  'P007002',
  'P007003',
  'P007005',
  'P007004',
  'P007006',
  'state',
  'place'],
 ['Altoona town', '939', '25', '0', '1', '0', '01', '01660'],
 ['Boaz city', '6929', '97', '33', '35', '3', '01', '07912'],
 ['Calera city', '2445', '629', '17', '6', '2', '01', '11416'],
 ['Childersburg city', '3393', '1465', '3', '16', '1', '01', '14464']]

**Its format is integer.**

## Ethnicity

For ethnicity, 2000 SF1 has two main variables. <Br>
    
    

P004003 --> HISPANIC:Total notHispanic or Latino <br>
P004002-->HISPANIC:Total Hispanic or Latino <br>


In [24]:
year='2000'
dsource='sf1'
cols = 'NAME,P004003,P004002'
base_url = f'https://api.census.gov/data/{year}/{dsource}?get={cols}'
data_url= f'{base_url}&for=place:*&in=state:*&key={api_key}'
response = requests.get(data_url)
data=response.json()
# Check how many place code in the each API
print(year, len(data))
    




2000 25151


In [25]:
data[:5]

[['NAME', 'P004003', 'P004002', 'state', 'place'],
 ['Altoona town', '959', '25', '01', '01660'],
 ['Boaz city', '7042', '369', '01', '07912'],
 ['Calera city', '3098', '60', '01', '11416'],
 ['Childersburg city', '4897', '30', '01', '14464']]

**Its format is integer.**

## Single API Query for Sex, Race, Ethnicity

Since I checked sex, race, and ethnicity category, I want to run single API query for these categories. 


In [26]:
year='2000'
dsource='sf1'
cols = 'NAME,P001001,P012002,P012026,P007002,P007003,P007005,P007004,P007006,P004003,P004002'
base_url = f'https://api.census.gov/data/{year}/{dsource}?get={cols}'
data_url= f'{base_url}&for=place:*&in=state:*&key={api_key}'
response = requests.get(data_url)
data=response.json()
# Check how many place code in the each API
print(year, len(data))
    




2000 25151


In [27]:
cols_sre= ['NAME', 'census_population', 'pop_male', 'pop_female', \
           'pop_white', 'pop_black', 'pop_asian','pop_AIAN','pop_NHOPI', \
           'pop_nonhispanic','pop_hispanic','fips_state', 'fips_place']

# Set up each year's dataset



cities_2000 = pd.DataFrame(data[1:], columns=cols_sre)
cities_2000["Year"]=2000

In [28]:
# Creating Fips_state_place key for joining dataframe

cities_2000['fips_state_place'] = cities_2000['fips_state']+cities_2000['fips_place']

In [29]:
# filter cities with previous city list for NYU+2018 ACS1 city list, which is result_cities

cities_2000_df = cities_2000[cities_2000['fips_state_place'].isin(result_cities)]

In [30]:
len(cities_2000_df)

625

In [31]:
# Creating NYU column

cities_2000_df['NYU_500_Largest'] = cities_2000_df['fips_state_place'].apply(lambda x:1 if x in nyu_cities else 0)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until


In [32]:
cities_2000_df.head()

Unnamed: 0,NAME,census_population,pop_male,pop_female,pop_white,pop_black,pop_asian,pop_AIAN,pop_NHOPI,pop_nonhispanic,pop_hispanic,fips_state,fips_place,Year,fips_state_place,NYU_500_Largest
5,Dothan city,57737,27093,30644,38873,17385,491,160,11,56973,764,1,21184,2000,121184,0
32,Auburn city,42987,21431,21556,33553,7217,1422,82,17,42321,666,1,3076,2000,103076,0
42,Birmingham city,242820,112046,130774,58457,178372,1942,422,87,239056,3764,1,7000,2000,107000,1
69,Hoover city,62742,30577,32165,54997,4248,1812,99,21,60362,2380,1,35896,2000,135896,1
179,Huntsville city,158216,76174,82042,101998,47792,3519,857,88,154991,3225,1,37000,2000,137000,1


## Age Group

For 18+, one variable may give right value.

P005001 --> RACE 18+:Total! <br>

For 65+, two variables may give right values. <br>

PCT017027--> GrpQrt/Sex/Age/Type:Male:65 years and over: <br>
PCT017064--> GrpQrt/Sex/Age/Type:Female:65 years and over! <br>

After the first query, I find out that these two variables do not give correct values for 65 years and over. 
For example, Auburn city, it gave me, 38 (Male)	and 188(Female). When I run separate query for male 65+ (SEX BY AGE:Male:65'&'66, SEX BY AGE:Male:67 to 69,SEX BY AGE:Male:70 to 74, SEX BY AGE:Male:75 to 79, SEX BY AGE:Male:80 to 84, SEX BY AGE:Male:85 yrs'&'over), it gave the following, 
151',  '217',  '263',  '212',  '128',  '92'<br>
  
  Which added to 1,063. There are large differences to PCT017027. <Br>
  Therefore, I decided not to use this grouped variables and will combine the following variables:
  
  Variable Name | Variable Label
-- | --
P012020 | SEX BY   AGE:Male:65'&'66
P012021 | SEX BY   AGE:Male:67 to 69
P012022 | SEX BY   AGE:Male:70 to 74
P012023 | SEX BY   AGE:Male:75 to 79
P012024 | SEX BY   AGE:Male:80 to 84
P012025 | SEX BY   AGE:Male:85 yrs'&'over
P012044 | SEX BY   AGE:Female:65'&'66
P012045 | SEX BY   AGE:Female:67 to 69
P012046 | SEX BY   AGE:Female:70 to 74
P012047 | SEX BY   AGE:Female:75 to 79
P012048 | SEX BY   AGE:Female:80 to 84
P012049 | SEX BY   AGE:Female:85 yrs'&'over

  
For 15 to 44 years, I may need to combine the following variables:


Variable   Name | Variable Label
-- | --
P012006 | SEX BY   AGE:Male:15 to 17
P012007 | SEX BY   AGE:Male:18'&'19
P012008 | SEX BY   AGE:Male:20
P012009 | SEX BY   AGE:Male:21
P012010 | SEX BY   AGE:Male:22 to 24
P012011 | SEX BY   AGE:Male:25 to 29
P012012 | SEX BY   AGE:Male:30 to 34
P012013 | SEX BY   AGE:Male:35 to 39
P012014 | SEX BY   AGE:Male:40 to 44
P012030 | SEX BY   AGE:Female:15 to 17
P012031 | SEX BY   AGE:Female:18'&'19
P012032 | SEX BY   AGE:Female:20
P012033 | SEX BY   AGE:Female:21
P012034 | SEX BY   AGE:Female:22 to 24
P012035 | SEX BY   AGE:Female:25 to 29
P012036 | SEX BY   AGE:Female:30 to 34
P012037 | SEX BY   AGE:Female:35 to 39
P012038 | SEX BY   AGE:Female:40 to 44





This is the chart for how I made the calculation for Age group

Year | 15 to 44 years | 18 years and over | 65 years and over
-- | -- | -- | --
Formula | Sum of all variables between 15-44 years | 18 +: total | Sum of 2   variables from Group Quarters Population By Sex By Age By Group Quarters Type
Variable | P012006     P012007     P012008     P012009     P012010     P012011     P012012     P012013     P012014     P012030     P012031     P012032     P012033     P012034     P012035     P012036     P012037     P012038 | P005001 | P012020     P012021     P012022     P012023     P012024     P012025     P012044     P012045     P012046     P012047     P012048     P012049

In [33]:
year='2000'
dsource='sf1'
# API address doesn't recognize if there are spaces between col, so it becomes longer.
cols = 'NAME,P001001,P012006,P012007,P012008,P012009,P012010,P012011,P012012,P012013,P012014,P012030,P012031,P012032,P012033,P012034,P012035,P012036,P012037,P012038,P005001,P012020,P012021,P012022,P012023,P012024,P012025,P012044,P012045,P012046,P012047,P012048,P012049'
base_url = f'https://api.census.gov/data/{year}/{dsource}?get={cols}'
data_url= f'{base_url}&for=place:*&in=state:*&key={api_key}'
response = requests.get(data_url)
data=response.json()
# Check how many place code in the each API
print(year, len(data))
    


cols_age= ['NAME', 'census_pop','Male:15 to 17','Male:18 and 19', 'Male:20', 'Male:21','Male:22 to 24','Male:25 to 29',\
           'Male:30 to 34','Male:35 to 39','Male:40 to 44',\
           'Female:15 to 17','Female:18 and 19','Female:20','Female:21','Female:22 to 24',\
           'Female:25 to 29','Female:30 to 34','Female:35 to 39','Female:40 to 44',\
           '18 +: total', 'Male:65 and 66','Male:67 to 69','Male:70 to 74','Male:75 to 79','Male:80 to 84','Male:85 yrs and over',\
           'Female:65 and 66','Female:67 to 69','Female:70 to 74','Female:75 to 79','Female:80 to 84',\
           'Female:85 yrs and over','fips_state', 'fips_place']

# Set up each year's dataset



cities_2000_age = pd.DataFrame(data[1:], columns=cols_age)
cities_2000_age["Year"]=2000   

# Creating Fips_state_place key for joining dataframe

cities_2000_age['fips_state_place'] = cities_2000_age['fips_state']+cities_2000_age['fips_place']

# filter cities with previous city list for NYU+2018 ACS1 city list, which is result_cities

cities_2000_age2 = cities_2000_age[cities_2000_age['fips_state_place'].isin(result_cities)]

2000 25151


In [34]:
cities_2000_age2.head()

Unnamed: 0,NAME,census_pop,Male:15 to 17,Male:18 and 19,Male:20,Male:21,Male:22 to 24,Male:25 to 29,Male:30 to 34,Male:35 to 39,Male:40 to 44,Female:15 to 17,Female:18 and 19,Female:20,Female:21,Female:22 to 24,Female:25 to 29,Female:30 to 34,Female:35 to 39,Female:40 to 44,18 +: total,Male:65 and 66,Male:67 to 69,Male:70 to 74,Male:75 to 79,Male:80 to 84,Male:85 yrs and over,Female:65 and 66,Female:67 to 69,Female:70 to 74,Female:75 to 79,Female:80 to 84,Female:85 yrs and over,fips_state,fips_place,Year,fips_state_place
5,Dothan city,57737,1271,727,365,306,959,1757,1724,2077,2151,1250,699,382,308,1109,2014,1969,2290,2339,43061,429,570,898,619,392,261,526,746,1241,1083,769,882,1,21184,2000,121184
32,Auburn city,42987,563,2532,1859,1809,3490,1977,1081,962,968,559,3007,1945,1802,2714,1422,969,1027,1015,36383,151,217,263,212,128,92,159,237,366,349,281,311,1,3076,2000,103076
42,Birmingham city,242820,4938,3469,1848,1903,5363,9265,8071,8431,8555,4989,3929,2041,2054,6302,9981,8879,9517,10125,182013,1373,1939,3160,2593,1648,1061,1954,2875,5087,4361,3059,3572,1,7000,2000,107000
69,Hoover city,62742,1302,653,303,309,1245,2517,2287,2595,2656,1234,548,291,289,1349,2452,2328,2773,2868,47200,340,493,755,592,306,215,412,616,1029,870,582,606,1,35896,2000,135896
179,Huntsville city,158216,3147,2724,1283,1226,3374,5330,5228,6096,6181,3048,2450,1273,1243,3382,5397,5378,6435,6383,121601,1283,1799,2363,1718,985,524,1584,2083,3107,2558,1713,1445,1,37000,2000,137000


In [35]:
cities_2000_age2.columns

Index(['NAME', 'census_pop', 'Male:15 to 17', 'Male:18 and 19', 'Male:20',
       'Male:21', 'Male:22 to 24', 'Male:25 to 29', 'Male:30 to 34',
       'Male:35 to 39', 'Male:40 to 44', 'Female:15 to 17', 'Female:18 and 19',
       'Female:20', 'Female:21', 'Female:22 to 24', 'Female:25 to 29',
       'Female:30 to 34', 'Female:35 to 39', 'Female:40 to 44', '18 +: total',
       'Male:65 and 66', 'Male:67 to 69', 'Male:70 to 74', 'Male:75 to 79',
       'Male:80 to 84', 'Male:85 yrs and over', 'Female:65 and 66',
       'Female:67 to 69', 'Female:70 to 74', 'Female:75 to 79',
       'Female:80 to 84', 'Female:85 yrs and over', 'fips_state', 'fips_place',
       'Year', 'fips_state_place'],
      dtype='object')

In [36]:
# Before Calculation, change column type from string to int.
age_int_col = ['census_pop', 'Male:15 to 17', 'Male:18 and 19', 'Male:20',
       'Male:21', 'Male:22 to 24', 'Male:25 to 29', 'Male:30 to 34',
       'Male:35 to 39', 'Male:40 to 44', 'Female:15 to 17', 'Female:18 and 19',
       'Female:20', 'Female:21', 'Female:22 to 24', 'Female:25 to 29',
       'Female:30 to 34', 'Female:35 to 39', 'Female:40 to 44', '18 +: total',
       'Male:65 and 66', 'Male:67 to 69', 'Male:70 to 74', 'Male:75 to 79',
       'Male:80 to 84', 'Male:85 yrs and over', 'Female:65 and 66',
       'Female:67 to 69', 'Female:70 to 74', 'Female:75 to 79',
       'Female:80 to 84', 'Female:85 yrs and over']


for col in age_int_col:
    cities_2000_age2[col]=cities_2000_age2[col].astype(int)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


In [37]:
cities_2000_age2.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 625 entries, 5 to 24972
Data columns (total 37 columns):
 #   Column                  Non-Null Count  Dtype 
---  ------                  --------------  ----- 
 0   NAME                    625 non-null    object
 1   census_pop              625 non-null    int64 
 2   Male:15 to 17           625 non-null    int64 
 3   Male:18 and 19          625 non-null    int64 
 4   Male:20                 625 non-null    int64 
 5   Male:21                 625 non-null    int64 
 6   Male:22 to 24           625 non-null    int64 
 7   Male:25 to 29           625 non-null    int64 
 8   Male:30 to 34           625 non-null    int64 
 9   Male:35 to 39           625 non-null    int64 
 10  Male:40 to 44           625 non-null    int64 
 11  Female:15 to 17         625 non-null    int64 
 12  Female:18 and 19        625 non-null    int64 
 13  Female:20               625 non-null    int64 
 14  Female:21               625 non-null    int64 
 15  Fema

In [38]:
# Calculation for 15-44 years.

cities_2000_age2['15 to 44 years'] =100*round((cities_2000_age2['Male:15 to 17']+cities_2000_age2['Male:18 and 19']\
                                               +cities_2000_age2['Male:20']+cities_2000_age2['Male:21']\
                                               +cities_2000_age2['Male:22 to 24']+cities_2000_age2['Male:25 to 29']\
                                               +cities_2000_age2['Male:30 to 34']+cities_2000_age2['Male:35 to 39']\
                                               +cities_2000_age2['Male:40 to 44']+cities_2000_age2['Female:15 to 17']\
                                               +cities_2000_age2['Female:18 and 19']+cities_2000_age2['Female:20']\
                                               +cities_2000_age2['Female:21']+cities_2000_age2['Female:22 to 24']\
                                               +cities_2000_age2['Female:25 to 29']+cities_2000_age2['Female:30 to 34']\
                                               +cities_2000_age2['Female:35 to 39']+cities_2000_age2['Female:40 to 44'])\
                                              /cities_2000_age2['census_pop'],4)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  if sys.path[0] == '':


In [39]:
# Calculation for 18 years and over

cities_2000_age2['18 years and over'] = 100*round(cities_2000_age2['18 +: total']/cities_2000_age2['census_pop'],4)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until


In [40]:
# calculation for 65 years and over
cities_2000_age2['65 years and over'] =100*round((cities_2000_age2['Male:65 and 66']+cities_2000_age2['Male:67 to 69']\
                                               +cities_2000_age2['Male:70 to 74']+cities_2000_age2['Male:75 to 79']\
                                               +cities_2000_age2['Male:80 to 84']+cities_2000_age2['Male:85 yrs and over']\
                                               +cities_2000_age2['Female:65 and 66']+cities_2000_age2['Female:67 to 69']\
                                               +cities_2000_age2['Female:70 to 74']+cities_2000_age2['Female:75 to 79']\
                                               +cities_2000_age2['Female:80 to 84']+cities_2000_age2['Female:85 yrs and over'])\
                                              /cities_2000_age2['census_pop'],4)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


In [41]:
cities_2000_age2.head()

Unnamed: 0,NAME,census_pop,Male:15 to 17,Male:18 and 19,Male:20,Male:21,Male:22 to 24,Male:25 to 29,Male:30 to 34,Male:35 to 39,Male:40 to 44,Female:15 to 17,Female:18 and 19,Female:20,Female:21,Female:22 to 24,Female:25 to 29,Female:30 to 34,Female:35 to 39,Female:40 to 44,18 +: total,Male:65 and 66,Male:67 to 69,Male:70 to 74,Male:75 to 79,Male:80 to 84,Male:85 yrs and over,Female:65 and 66,Female:67 to 69,Female:70 to 74,Female:75 to 79,Female:80 to 84,Female:85 yrs and over,fips_state,fips_place,Year,fips_state_place,15 to 44 years,18 years and over,65 years and over
5,Dothan city,57737,1271,727,365,306,959,1757,1724,2077,2151,1250,699,382,308,1109,2014,1969,2290,2339,43061,429,570,898,619,392,261,526,746,1241,1083,769,882,1,21184,2000,121184,41.04,74.58,14.58
32,Auburn city,42987,563,2532,1859,1809,3490,1977,1081,962,968,559,3007,1945,1802,2714,1422,969,1027,1015,36383,151,217,263,212,128,92,159,237,366,349,281,311,1,3076,2000,103076,69.09,84.64,6.43
42,Birmingham city,242820,4938,3469,1848,1903,5363,9265,8071,8431,8555,4989,3929,2041,2054,6302,9981,8879,9517,10125,182013,1373,1939,3160,2593,1648,1061,1954,2875,5087,4361,3059,3572,1,7000,2000,107000,45.16,74.96,13.46
69,Hoover city,62742,1302,653,303,309,1245,2517,2287,2595,2656,1234,548,291,289,1349,2452,2328,2773,2868,47200,340,493,755,592,306,215,412,616,1029,870,582,606,1,35896,2000,135896,44.63,75.23,10.86
179,Huntsville city,158216,3147,2724,1283,1226,3374,5330,5228,6096,6181,3048,2450,1273,1243,3382,5397,5378,6435,6383,121601,1283,1799,2363,1718,985,524,1584,2083,3107,2558,1713,1445,1,37000,2000,137000,43.98,76.86,13.38


In [42]:
age_2000_final = cities_2000_age2[['NAME','15 to 44 years','18 years and over', '65 years and over','fips_state',\
       'fips_place', 'Year', 'fips_state_place' ]]

In [43]:
age_2000_final.head()

Unnamed: 0,NAME,15 to 44 years,18 years and over,65 years and over,fips_state,fips_place,Year,fips_state_place
5,Dothan city,41.04,74.58,14.58,1,21184,2000,121184
32,Auburn city,69.09,84.64,6.43,1,3076,2000,103076
42,Birmingham city,45.16,74.96,13.46,1,7000,2000,107000
69,Hoover city,44.63,75.23,10.86,1,35896,2000,135896
179,Huntsville city,43.98,76.86,13.38,1,37000,2000,137000


In [44]:
# joining with previous dataframe for sex, race, ethnicity.

year2000_all_categories= pd.merge(cities_2000_df, age_2000_final, how='left',on=['fips_state_place','Year','NAME','fips_state', 'fips_place'])

In [45]:
year2000_all_categories.head()

Unnamed: 0,NAME,census_population,pop_male,pop_female,pop_white,pop_black,pop_asian,pop_AIAN,pop_NHOPI,pop_nonhispanic,pop_hispanic,fips_state,fips_place,Year,fips_state_place,NYU_500_Largest,15 to 44 years,18 years and over,65 years and over
0,Dothan city,57737,27093,30644,38873,17385,491,160,11,56973,764,1,21184,2000,121184,0,41.04,74.58,14.58
1,Auburn city,42987,21431,21556,33553,7217,1422,82,17,42321,666,1,3076,2000,103076,0,69.09,84.64,6.43
2,Birmingham city,242820,112046,130774,58457,178372,1942,422,87,239056,3764,1,7000,2000,107000,1,45.16,74.96,13.46
3,Hoover city,62742,30577,32165,54997,4248,1812,99,21,60362,2380,1,35896,2000,135896,1,44.63,75.23,10.86
4,Huntsville city,158216,76174,82042,101998,47792,3519,857,88,154991,3225,1,37000,2000,137000,1,43.98,76.86,13.38


In [46]:
# Check nan-value cell
print(year2000_all_categories.isnull().sum().sum())

0


# Combine with STATE name 

2000 Census SF1 does not have st_name that we use in the final df. So, I will pull 2005-2018 dataset and create st_name columns




In [47]:
# Pull out All State FIPS Code from Geocodes csv
geocodes = pd.read_csv(r"/Users/jasonlim/Desktop/EFGS/Census-master/Geocodes/state-geocodes-v2018.csv")


In [48]:
cols = geocodes.iloc[4,].values

In [49]:
cols

array(['Region', 'Division', 'State (FIPS)', 'Name'], dtype=object)

In [50]:
geocodes = geocodes[5:]

geocodes.columns = cols

In [51]:
geocodes.head()

Unnamed: 0,Region,Division,State (FIPS),Name
5,1,0,0,Northeast Region
6,1,1,0,New England Division
7,1,1,9,Connecticut
8,1,1,23,Maine
9,1,1,25,Massachusetts


In [52]:
geocodes.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 64 entries, 5 to 68
Data columns (total 4 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   Region        64 non-null     object
 1   Division      64 non-null     object
 2   State (FIPS)  64 non-null     object
 3   Name          64 non-null     object
dtypes: object(4)
memory usage: 2.1+ KB


In [53]:
# Filter out region codes

geo_state = geocodes[geocodes['State (FIPS)'] !='00']

In [54]:
staet_dic = dict(zip(geo_state['State (FIPS)'], geo_state['Name']))

state_dic1 = OrderedDict(sorted(staet_dic.items())) 

state_dic1

OrderedDict([('01', 'Alabama'),
             ('02', 'Alaska'),
             ('04', 'Arizona'),
             ('05', 'Arkansas'),
             ('06', 'California'),
             ('08', 'Colorado'),
             ('09', 'Connecticut'),
             ('10', 'Delaware'),
             ('11', 'District of Columbia'),
             ('12', 'Florida'),
             ('13', 'Georgia'),
             ('15', 'Hawaii'),
             ('16', 'Idaho'),
             ('17', 'Illinois'),
             ('18', 'Indiana'),
             ('19', 'Iowa'),
             ('20', 'Kansas'),
             ('21', 'Kentucky'),
             ('22', 'Louisiana'),
             ('23', 'Maine'),
             ('24', 'Maryland'),
             ('25', 'Massachusetts'),
             ('26', 'Michigan'),
             ('27', 'Minnesota'),
             ('28', 'Mississippi'),
             ('29', 'Missouri'),
             ('30', 'Montana'),
             ('31', 'Nebraska'),
             ('32', 'Nevada'),
             ('33', 'New Hampshire'),
  

In [55]:
def set_value(row_number, assigned_value): 
    return assigned_value[row_number] 

In [56]:
# Add a new column named 'st_name' 
year2000_all_categories['st_name'] = year2000_all_categories['fips_state'].apply(set_value, args =(state_dic1, ))

In [57]:
year2000_all_categories.head()

Unnamed: 0,NAME,census_population,pop_male,pop_female,pop_white,pop_black,pop_asian,pop_AIAN,pop_NHOPI,pop_nonhispanic,pop_hispanic,fips_state,fips_place,Year,fips_state_place,NYU_500_Largest,15 to 44 years,18 years and over,65 years and over,st_name
0,Dothan city,57737,27093,30644,38873,17385,491,160,11,56973,764,1,21184,2000,121184,0,41.04,74.58,14.58,Alabama
1,Auburn city,42987,21431,21556,33553,7217,1422,82,17,42321,666,1,3076,2000,103076,0,69.09,84.64,6.43,Alabama
2,Birmingham city,242820,112046,130774,58457,178372,1942,422,87,239056,3764,1,7000,2000,107000,1,45.16,74.96,13.46,Alabama
3,Hoover city,62742,30577,32165,54997,4248,1812,99,21,60362,2380,1,35896,2000,135896,1,44.63,75.23,10.86,Alabama
4,Huntsville city,158216,76174,82042,101998,47792,3519,857,88,154991,3225,1,37000,2000,137000,1,43.98,76.86,13.38,Alabama


In [58]:
# Change column name to our name

year2000_all_categories.rename(columns={"NAME": 'city_name'}, inplace=True)

In [59]:
# Assign column type from string to int for numeric columns
year2000_all_categories['census_population'] = year2000_all_categories['census_population'].astype(int)
year2000_all_categories['pop_male'] = year2000_all_categories['pop_male'].astype(int)
year2000_all_categories['pop_female'] = year2000_all_categories['pop_female'].astype(int)
year2000_all_categories['pop_white'] = year2000_all_categories['pop_white'].astype(int)
year2000_all_categories['pop_black'] = year2000_all_categories['pop_black'].astype(int)
year2000_all_categories['pop_asian'] = year2000_all_categories['pop_asian'].astype(int)
year2000_all_categories['pop_AIAN'] = year2000_all_categories['pop_AIAN'].astype(int)
year2000_all_categories['pop_NHOPI'] = year2000_all_categories['pop_NHOPI'].astype(int)
year2000_all_categories['pop_nonhispanic'] = year2000_all_categories['pop_nonhispanic'].astype(int)
year2000_all_categories['pop_hispanic'] = year2000_all_categories['pop_hispanic'].astype(int)



In [60]:
year2000_all_categories.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 625 entries, 0 to 624
Data columns (total 20 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   city_name          625 non-null    object 
 1   census_population  625 non-null    int64  
 2   pop_male           625 non-null    int64  
 3   pop_female         625 non-null    int64  
 4   pop_white          625 non-null    int64  
 5   pop_black          625 non-null    int64  
 6   pop_asian          625 non-null    int64  
 7   pop_AIAN           625 non-null    int64  
 8   pop_NHOPI          625 non-null    int64  
 9   pop_nonhispanic    625 non-null    int64  
 10  pop_hispanic       625 non-null    int64  
 11  fips_state         625 non-null    object 
 12  fips_place         625 non-null    object 
 13  Year               625 non-null    int64  
 14  fips_state_place   625 non-null    object 
 15  NYU_500_Largest    625 non-null    int64  
 16  15 to 44 years     625 non

### There is no empty row. 

In [61]:

# Re-order the dataset
year2000_all_categories = year2000_all_categories[['fips_state','fips_place','fips_state_place','city_name','st_name',\
                                                   'NYU_500_Largest','Year','census_population',\
                                                   '15 to 44 years','18 years and over', '65 years and over',\
                                                   'pop_male', 'pop_female',\
                                                   'pop_white', 'pop_black', 'pop_asian','pop_AIAN', 'pop_NHOPI',\
                                                   'pop_nonhispanic', 'pop_hispanic']]

# Create CSV and Excel File
year2000_all_categories.to_csv("place_pop_2000_asrh.csv", index=False)
year2000_all_categories.to_excel("place_pop_2000_asrh.xlsx", index=False)

# combine with 2005-2018 DATASET

In [62]:


# Pull out 2005-2018
# Preselect column type as object
coltype = {'fips_state':object,
          'fips_place':object,
           'fips_state_place':object}
df_2005_2018 = pd.read_csv(r"/Users/jasonlim/Desktop/EFGS/Census-master/CitiesDataset/Cities_Dataset_Year/2018/All_year_combined/place_pop_2005_2018_asrh.csv",\
                  dtype= coltype)

In [63]:
df_2005_2018.head()

Unnamed: 0,fips_state,fips_place,fips_state_place,city_name,st_name,NYU_500_Largest,Year,census_population,15 to 44 years,18 years and over,65 years and over,pop_male,pop_female,pop_white,pop_black,pop_asian,pop_AIAN,pop_NHOPI,pop_nonhispanic,pop_hispanic
0,29,35000,2935000,Independence city,Missouri,1,2005,111842,39.2,77.45,14.87,57505,54337,99498,4218,689,129,54,-998,-998
1,29,38000,2938000,Kansas City city,Missouri,1,2005,440885,44.74,75.39,10.56,212392,228493,271210,132187,10878,1988,1054,404890,35995
2,29,41348,2941348,Lee's Summit city,Missouri,1,2005,86357,39.48,69.75,10.24,40180,46177,78574,4712,821,268,116,-998,-998
3,29,54074,2954074,O'Fallon city,Missouri,1,2005,67875,46.09,69.5,6.16,34618,33257,62989,3406,682,0,0,-998,-998
4,29,64550,2964550,St. Joseph city,Missouri,1,2005,68971,42.88,75.05,13.79,32857,36114,64150,3038,457,395,0,-998,-998


In [64]:
df_2005_2018.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7905 entries, 0 to 7904
Data columns (total 20 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   fips_state         7905 non-null   object 
 1   fips_place         7905 non-null   object 
 2   fips_state_place   7905 non-null   object 
 3   city_name          7905 non-null   object 
 4   st_name            7905 non-null   object 
 5   NYU_500_Largest    7905 non-null   int64  
 6   Year               7905 non-null   int64  
 7   census_population  7905 non-null   int64  
 8   15 to 44 years     7905 non-null   float64
 9   18 years and over  7905 non-null   float64
 10  65 years and over  7905 non-null   float64
 11  pop_male           7905 non-null   int64  
 12  pop_female         7905 non-null   int64  
 13  pop_white          7905 non-null   int64  
 14  pop_black          7905 non-null   int64  
 15  pop_asian          7905 non-null   int64  
 16  pop_AIAN           7905 

In [65]:
year2000_all_categories.head()

Unnamed: 0,fips_state,fips_place,fips_state_place,city_name,st_name,NYU_500_Largest,Year,census_population,15 to 44 years,18 years and over,65 years and over,pop_male,pop_female,pop_white,pop_black,pop_asian,pop_AIAN,pop_NHOPI,pop_nonhispanic,pop_hispanic
0,1,21184,121184,Dothan city,Alabama,0,2000,57737,41.04,74.58,14.58,27093,30644,38873,17385,491,160,11,56973,764
1,1,3076,103076,Auburn city,Alabama,0,2000,42987,69.09,84.64,6.43,21431,21556,33553,7217,1422,82,17,42321,666
2,1,7000,107000,Birmingham city,Alabama,1,2000,242820,45.16,74.96,13.46,112046,130774,58457,178372,1942,422,87,239056,3764
3,1,35896,135896,Hoover city,Alabama,1,2000,62742,44.63,75.23,10.86,30577,32165,54997,4248,1812,99,21,60362,2380
4,1,37000,137000,Huntsville city,Alabama,1,2000,158216,43.98,76.86,13.38,76174,82042,101998,47792,3519,857,88,154991,3225


In [66]:
df_2000_2018 = pd.concat([year2000_all_categories,df_2005_2018])

In [67]:
df_2000_2018.head()

Unnamed: 0,fips_state,fips_place,fips_state_place,city_name,st_name,NYU_500_Largest,Year,census_population,15 to 44 years,18 years and over,65 years and over,pop_male,pop_female,pop_white,pop_black,pop_asian,pop_AIAN,pop_NHOPI,pop_nonhispanic,pop_hispanic
0,1,21184,121184,Dothan city,Alabama,0,2000,57737,41.04,74.58,14.58,27093,30644,38873,17385,491,160,11,56973,764
1,1,3076,103076,Auburn city,Alabama,0,2000,42987,69.09,84.64,6.43,21431,21556,33553,7217,1422,82,17,42321,666
2,1,7000,107000,Birmingham city,Alabama,1,2000,242820,45.16,74.96,13.46,112046,130774,58457,178372,1942,422,87,239056,3764
3,1,35896,135896,Hoover city,Alabama,1,2000,62742,44.63,75.23,10.86,30577,32165,54997,4248,1812,99,21,60362,2380
4,1,37000,137000,Huntsville city,Alabama,1,2000,158216,43.98,76.86,13.38,76174,82042,101998,47792,3519,857,88,154991,3225


In [68]:
df_2000_2018.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 8530 entries, 0 to 7904
Data columns (total 20 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   fips_state         8530 non-null   object 
 1   fips_place         8530 non-null   object 
 2   fips_state_place   8530 non-null   object 
 3   city_name          8530 non-null   object 
 4   st_name            8530 non-null   object 
 5   NYU_500_Largest    8530 non-null   int64  
 6   Year               8530 non-null   int64  
 7   census_population  8530 non-null   int64  
 8   15 to 44 years     8530 non-null   float64
 9   18 years and over  8530 non-null   float64
 10  65 years and over  8530 non-null   float64
 11  pop_male           8530 non-null   int64  
 12  pop_female         8530 non-null   int64  
 13  pop_white          8530 non-null   int64  
 14  pop_black          8530 non-null   int64  
 15  pop_asian          8530 non-null   int64  
 16  pop_AIAN           8530 

In [69]:
# Create New files
# df_2000_2018.to_csv("place_pop_2000_2018_asrh.csv", index=False)
# df_2000_2018.to_excel("place_pop_2000_2018_asrh.xlsx", index=False)

# Year 1990
For Year 1990, 1990 Decennial Census of Population and Housing - Summary File 1: Summary File 1 Dataset may give us the variable that we can use it. 

https://api.census.gov/data/1990/sf1/variables.html

First, check how many cities(Place FIPS CODE) are there.

For Census population, I can use P0010001 -->Total Persons <br>to know total population and how many Cities are in the dataset. Also, previously, city name variable is NAME, but in 1990, the variable for City NAme is ANPSADPI. 



In [70]:
year='1990'
dsource='sf1'
cols = 'ANPSADPI,P0010001'
base_url = f'https://api.census.gov/data/{year}/{dsource}?get={cols}'
data_url= f'{base_url}&for=place:*&in=state:*&key={api_key}'
response = requests.get(data_url)
data=response.json()
# Check how many place code in the each API
print(year, len(data))
    




1990 23436


In [71]:
data[:5]

[['ANPSADPI', 'P0010001', 'state', 'place'],
 ['Abbeville city', '3173', '01', '00124'],
 ['Adamsville city', '4161', '01', '00460'],
 ['Addison town', '626', '01', '00484'],
 ['Akron town', '468', '01', '00676']]

After I checked, there are 23,456 cities in the Year 1990 SF1 dataset. Yet, the same condition would apply, we only need 644 cities that we mentioned in the Year 2000 dataset. So, I will filter out the city first. 

In [72]:
cols = ['NAME', 'Census_population', 'fips_state', 'fips_place']
city_1990 = pd.DataFrame(data[1:], columns=cols)

In [73]:
# Creating Fips_state_place key for joining dataframe

city_1990['fips_state_place'] = city_1990['fips_state']+city_1990['fips_place']

In [74]:
city_1990_first = city_1990[city_1990['fips_state_place'].isin(result_cities)]

In [75]:
city_1990_first.head()

Unnamed: 0,NAME,Census_population,fips_state,fips_place,fips_state_place
24,Auburn city,33830,1,3076,103076
40,Birmingham city,265968,1,7000,107000
112,Dothan city,53589,1,21184,121184
218,Hoover city,39788,1,35896,135896
221,Huntsville city,159789,1,37000,137000


In [76]:
len(city_1990_first)

607

In [77]:
# Creating NYU 500 largest column to know how many NYU cities are covered.
city_1990_first['NYU_500_Largest'] = city_1990_first['fips_state_place'].apply(lambda x:1 if x in nyu_cities else 0)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


In [78]:
city_1990_first['NYU_500_Largest'].sum()

492

We will have 607 cities, and 492 NYU cities are covered. <Br>

For each category, I will use same condition that I use previously.<br>

## Before establishing each category dataset,

Each category may have different formats and some categories may require calculations (i.e. Age Grouping).<br>
Since this is just a single year dataset, I may get everything in a single API run with all variables that we need. Yet, I may not know the format and how many non-value with this method. Also, the calculation will not perform efficiently with too many columns. So, I will divide this into 3 parts: <br>

1. Check variables for each category and get suitable variables for our dataset. Also, check category variables for whether it's required calculation or not. For example, the Sex category does not need calculation, but for the age grouping of 65 years or over, there would be several variables and I will combine it.
2. For non-calculation required variables, I will pull these variables together with a single API run.
3. For calculation required variables, I will pull these variables separately and calculate to create the fields we want. Then, I will combine it with my Year 2000 dataset.

## Sex

For Sex, 2000 SF1 has two main variables. <Br>

P0050001-->	Sex Male<br>
P0050002-->	Sex Female <BR>
    
    
Also, I want to check the format of these two variables. 



In [79]:
year='1990'
dsource='sf1'
cols = 'ANPSADPI,P0050001,P0050002'
base_url = f'https://api.census.gov/data/{year}/{dsource}?get={cols}'
data_url= f'{base_url}&for=place:*&in=state:*&key={api_key}'
response = requests.get(data_url)
data=response.json()
# Check how many place code in the each API
print(year, len(data))
    

1990 23436


In [80]:
data[:5]

[['ANPSADPI', 'P0050001', 'P0050002', 'state', 'place'],
 ['Abbeville city', '1462', '1711', '01', '00124'],
 ['Adamsville city', '2005', '2156', '01', '00460'],
 ['Addison town', '309', '317', '01', '00484'],
 ['Akron town', '214', '254', '01', '00676']]

**Confirmed its format-- Integer**

## Ethnicity

For ethnicity, 1990 SF1 has two main variables. <Br>
    
    

P0080001 --> Total Persons Hispanic origin <br>
P0090001-->Hispanic orig(5) No <br>
    
    I want to check P0090001 because it may not be non hispanic orig. So, I will pull total population together to check whether both numbers are added up to total population

In [81]:
year='1990'
dsource='sf1'
cols = 'ANPSADPI,P0010001,P0080001,P0090001'
base_url = f'https://api.census.gov/data/{year}/{dsource}?get={cols}'
data_url= f'{base_url}&for=place:*&in=state:*&key={api_key}'
response = requests.get(data_url)
data=response.json()
# Check how many place code in the each API
print(year, len(data))
    




1990 23436


In [82]:
# First five line check
data[:5]

[['ANPSADPI', 'P0010001', 'P0080001', 'P0090001', 'state', 'place'],
 ['Abbeville city', '3173', '15', '3158', '01', '00124'],
 ['Adamsville city', '4161', '12', '4149', '01', '00460'],
 ['Addison town', '626', '0', '626', '01', '00484'],
 ['Akron town', '468', '5', '463', '01', '00676']]

In [83]:
# Random middle check
data[234:240]

[['Killen town', '1047', '0', '1047', '01', '39784'],
 ['Kimberly town', '1096', '0', '1096', '01', '39856'],
 ['Kinsey town', '1679', '12', '1667', '01', '40072'],
 ['Kinston town', '595', '1', '594', '01', '40096'],
 ['Ladonia CDP', '2905', '24', '2881', '01', '40648'],
 ['Lafayette city', '3151', '7', '3144', '01', '40672']]

In [84]:
# Last five line Check
data[-5:]

[['Warren AFB CDP', '3832', '204', '3628', '56', '81640'],
 ['Wheatland town', '3271', '179', '3092', '56', '83040'],
 ['Worland city', '5742', '642', '5100', '56', '84925'],
 ['Wright town', '1236', '51', '1185', '56', '85015'],
 ['Yoder town', '136', '5', '131', '56', '86665']]

Confirmed that P0080001+P0090001 is total population. I will use these two variables for ethnicity. 

## Single API Query for Sex,  Ethnicity
Since I checked sex and ethnicity category, I want to run single API query for these categories.

In [85]:
year='1990'
dsource='sf1'
cols = 'ANPSADPI,P0010001,P0050001,P0050002,P0090001,P0080001'
base_url = f'https://api.census.gov/data/{year}/{dsource}?get={cols}'
data_url= f'{base_url}&for=place:*&in=state:*&key={api_key}'
response = requests.get(data_url)
data=response.json()
# Check how many place code in the each API
print(year, len(data))
    

cols_se= ['NAME', 'census_population', 'pop_male', 'pop_female', \
           'pop_nonhispanic','pop_hispanic','fips_state', 'fips_place']

# Set up each year's dataset



cities_1990 = pd.DataFrame(data[1:], columns=cols_se)
cities_1990["Year"]=1990
# Creating Fips_state_place key for joining dataframe

cities_1990['fips_state_place'] = cities_1990['fips_state']+cities_1990['fips_place']
# Filter only NYU and ACS1 Cities
cities_1990_df = cities_1990[cities_1990['fips_state_place'].isin(result_cities)]


# Creating NYU column

cities_1990_df['NYU_500_Largest'] = cities_1990_df['fips_state_place'].apply(lambda x:1 if x in nyu_cities else 0)

1990 23436


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


In [86]:
cities_1990_df.head()

Unnamed: 0,NAME,census_population,pop_male,pop_female,pop_nonhispanic,pop_hispanic,fips_state,fips_place,Year,fips_state_place,NYU_500_Largest
24,Auburn city,33830,17143,16687,33516,314,1,3076,1990,103076,0
40,Birmingham city,265968,121437,144531,264930,1038,1,7000,1990,107000,1
112,Dothan city,53589,25154,28435,53230,359,1,21184,1990,121184,0
218,Hoover city,39788,18823,20965,39422,366,1,35896,1990,135896,1
221,Huntsville city,159789,77676,82113,157810,1979,1,37000,1990,137000,1


In [87]:
cities_1990_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 607 entries, 24 to 23341
Data columns (total 11 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   NAME               607 non-null    object
 1   census_population  607 non-null    object
 2   pop_male           607 non-null    object
 3   pop_female         607 non-null    object
 4   pop_nonhispanic    607 non-null    object
 5   pop_hispanic       607 non-null    object
 6   fips_state         607 non-null    object
 7   fips_place         607 non-null    object
 8   Year               607 non-null    int64 
 9   fips_state_place   607 non-null    object
 10  NYU_500_Largest    607 non-null    int64 
dtypes: int64(2), object(9)
memory usage: 56.9+ KB


## Race

In 1990, race categorization is different. For example, there is no NATIVE HAWAIIAN AND OTHER PACIFIC ISLANDER ALONE in Race. Asian category contains pacific islander as Asian/Pacific Isldr. <br> 

Solution : 

Even though these races are grouped, there are some details of the race. For example, 
Race(25) White   
  Race(25) Black   
  Race(25) American Indian   
  Race(25) Eskimo   
  Race(25) Aleut   
  Race(25) Chinese   
  Race(25) Filipino   
  Race(25) Japanese   
  Race(25) Asian Indian   
  Race(25) Korean   
  Race(25) Vietnamese   
  Race(25) Cambodian   
  Race(25) Hmong   
  Race(25) Laotian   
  Race(25) Thai   
  Race(25) Other Asian   
  Race(25) Hawaiian   
  Race(25) Samoan   
  Race(25) Tongan   
  Race(25) Other Polynesian   
  Race(25) Guamanian   
  Race(25) Other Micronesian   
  Race(25) Melanesian   
  Race(25) Pacific Isldr, n_s_   
  Race(25) Other race   
I may combine to create a Race group for Asiaka, Asian, Native Hawaiian just for Year 1990. 

### Race Combined Chart

The following is how I combine these 24 races (excludes other race) into our columns 


Our Column | Census SF1 Column | SF 1 Column Concept
-- | -- | --
White | P0070001 | Race(25) White
Black or   African American | P0070002 | Race(25) Black
Asian alone | P0070006     P0070007     P0070008     P0070009     P0070010     P0070011     P0070012     P0070013     P0070014     P0070015     P0070016 | Race(25) Chinese     Race(25) Filipino     Race(25) Japanese     Race(25) Asian Indian     Race(25) Korean     Race(25) Vietnamese     Race(25) Cambodian     Race(25) Hmong     Race(25) Laotian     Race(25) Thai     Race(25) Other Asian
American   Indian and Alaska Native alone | P0070003     P0070004     P0070005 | Race(25) American Indian     Race(25) Eskimo     Race(25) Aleut
Native   Hawaiian and Other Pacific Islander alone | P0070017     P0070018     P0070019     P0070020     P0070021     P0070022     P0070023     P0070024 | Race(25) Hawaiian     Race(25) Samoan     Race(25) Tongan     Race(25) Other Polynesian     Race(25) Guamanian     Race(25) Other Micronesian     Race(25) Melanesian     Race(25) Pacific Isldr, n_s_



In [88]:
year='1990'
dsource='sf1'
cols = 'ANPSADPI,P0070001,P0070002,P0070006,P0070007,P0070008,P0070009,P0070010,P0070011,P0070012,P0070013,P0070014,P0070015,P0070016,P0070003,P0070004,P0070005,P0070017,P0070018,P0070019,P0070020,P0070021,P0070022,P0070023,P0070024'
base_url = f'https://api.census.gov/data/{year}/{dsource}?get={cols}'
data_url= f'{base_url}&for=place:*&in=state:*&key={api_key}'
print(data_url)
response = requests.get(data_url)
data=response.json()
# Check how many place code in the each API
print(year, len(data))



https://api.census.gov/data/1990/sf1?get=ANPSADPI,P0070001,P0070002,P0070006,P0070007,P0070008,P0070009,P0070010,P0070011,P0070012,P0070013,P0070014,P0070015,P0070016,P0070003,P0070004,P0070005,P0070017,P0070018,P0070019,P0070020,P0070021,P0070022,P0070023,P0070024&for=place:*&in=state:*&key=a45a33bca80b5e1907adc4587be5c346b897dda7
1990 23436


In [89]:
# Check first line
data[0]

['ANPSADPI',
 'P0070001',
 'P0070002',
 'P0070006',
 'P0070007',
 'P0070008',
 'P0070009',
 'P0070010',
 'P0070011',
 'P0070012',
 'P0070013',
 'P0070014',
 'P0070015',
 'P0070016',
 'P0070003',
 'P0070004',
 'P0070005',
 'P0070017',
 'P0070018',
 'P0070019',
 'P0070020',
 'P0070021',
 'P0070022',
 'P0070023',
 'P0070024',
 'state',
 'place']

In [90]:
# MAke First line into columns
cities_1990_race = pd.DataFrame(data[1:], columns=data[0])

In [91]:
cities_1990_race.head()

Unnamed: 0,ANPSADPI,P0070001,P0070002,P0070006,P0070007,P0070008,P0070009,P0070010,P0070011,P0070012,P0070013,P0070014,P0070015,P0070016,P0070003,P0070004,P0070005,P0070017,P0070018,P0070019,P0070020,P0070021,P0070022,P0070023,P0070024,state,place
0,Abbeville city,2039,1115,0,0,0,0,1,0,0,0,0,0,0,6,0,0,0,0,0,0,0,0,0,0,1,124
1,Adamsville city,3493,659,0,0,1,0,2,0,0,0,0,0,0,5,0,0,0,0,0,0,0,0,0,0,1,460
2,Addison town,624,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,484
3,Akron town,97,371,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,676
4,Alabaster city,13032,1617,2,1,4,7,21,1,0,0,0,0,10,30,0,0,1,0,0,0,1,0,0,0,1,820


In [92]:
cities_1990_race['fips_state_place'] = cities_1990_race['state']+cities_1990_race['place']

# Filter only NYU and ACS1 Cities
cities_1990_race_df = cities_1990_race[cities_1990_race['fips_state_place'].isin(result_cities)]




In [93]:
cities_1990_race_df.head()

Unnamed: 0,ANPSADPI,P0070001,P0070002,P0070006,P0070007,P0070008,P0070009,P0070010,P0070011,P0070012,P0070013,P0070014,P0070015,P0070016,P0070003,P0070004,P0070005,P0070017,P0070018,P0070019,P0070020,P0070021,P0070022,P0070023,P0070024,state,place,fips_state_place
24,Auburn city,27016,5531,493,63,36,297,127,34,1,0,25,16,40,51,3,6,2,1,0,0,3,0,0,0,1,3076,103076
40,Birmingham city,95655,168277,493,92,64,309,128,188,29,0,0,29,121,295,13,13,11,5,0,0,4,0,2,3,1,7000,107000
112,Dothan city,38312,14639,44,24,91,43,21,173,0,0,0,5,15,136,0,0,7,0,0,0,0,0,1,0,1,21184,121184
218,Hoover city,37886,1318,124,26,35,144,70,15,0,0,0,7,50,45,1,1,2,1,0,0,0,0,0,0,1,35896,135896
221,Huntsville city,116065,39016,776,174,261,946,761,166,23,0,53,47,135,800,11,5,48,2,0,2,32,1,0,5,1,37000,137000


In [94]:
# check df columns 

cities_1990_race_df.columns

Index(['ANPSADPI', 'P0070001', 'P0070002', 'P0070006', 'P0070007', 'P0070008',
       'P0070009', 'P0070010', 'P0070011', 'P0070012', 'P0070013', 'P0070014',
       'P0070015', 'P0070016', 'P0070003', 'P0070004', 'P0070005', 'P0070017',
       'P0070018', 'P0070019', 'P0070020', 'P0070021', 'P0070022', 'P0070023',
       'P0070024', 'state', 'place', 'fips_state_place'],
      dtype='object')

In [95]:
# Before Calculation, change column type from string to int.
race_col = ['P0070001', 'P0070002', 'P0070006', 'P0070007', 'P0070008',
       'P0070009', 'P0070010', 'P0070011', 'P0070012', 'P0070013', 'P0070014',
       'P0070015', 'P0070016', 'P0070003', 'P0070004', 'P0070005', 'P0070017',
       'P0070018', 'P0070019', 'P0070020', 'P0070021', 'P0070022', 'P0070023',
       'P0070024']


for col in race_col:
    cities_1990_race_df[col]=cities_1990_race_df[col].astype(int)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  # Remove the CWD from sys.path while we load stuff.


In [96]:
cities_1990_race_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 607 entries, 24 to 23341
Data columns (total 28 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   ANPSADPI          607 non-null    object
 1   P0070001          607 non-null    int64 
 2   P0070002          607 non-null    int64 
 3   P0070006          607 non-null    int64 
 4   P0070007          607 non-null    int64 
 5   P0070008          607 non-null    int64 
 6   P0070009          607 non-null    int64 
 7   P0070010          607 non-null    int64 
 8   P0070011          607 non-null    int64 
 9   P0070012          607 non-null    int64 
 10  P0070013          607 non-null    int64 
 11  P0070014          607 non-null    int64 
 12  P0070015          607 non-null    int64 
 13  P0070016          607 non-null    int64 
 14  P0070003          607 non-null    int64 
 15  P0070004          607 non-null    int64 
 16  P0070005          607 non-null    int64 
 17  P0070017     

In [97]:
# For White and Black, I only change Column name because I do not need calculation.


cities_1990_race_df.rename(columns={'P0070001': 'pop_white', 'P0070002':'pop_black',
                                   'ANPSADPI':'NAME' , 'state': 'fips_state', 'place':'fips_place'}, inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,


In [98]:
cities_1990_race_df.tail()

Unnamed: 0,NAME,pop_white,pop_black,P0070006,P0070007,P0070008,P0070009,P0070010,P0070011,P0070012,P0070013,P0070014,P0070015,P0070016,P0070003,P0070004,P0070005,P0070017,P0070018,P0070019,P0070020,P0070021,P0070022,P0070023,P0070024,fips_state,fips_place,fips_state_place
23057,Milwaukee city,398033,191255,1416,729,416,1551,642,839,94,3330,1779,100,739,5797,17,44,64,27,0,1,69,1,6,14,55,53000,5553000
23122,Oshkosh city,52945,435,67,16,29,132,53,25,1,727,93,8,46,254,4,15,2,0,0,0,1,0,1,0,55,60500,5560500
23163,Racine city,64378,15551,59,36,36,104,77,75,0,0,16,10,40,265,3,5,2,0,0,1,2,0,0,0,55,66000,5566000
23279,Waukesha city,54319,317,125,75,42,143,47,46,0,2,148,18,49,160,1,0,6,1,0,3,11,0,1,2,55,84250,5584250
23341,Cheyenne city,44814,1561,54,129,139,31,109,22,0,0,0,34,16,349,1,1,24,11,0,0,12,2,0,1,56,13900,5613900


In [99]:
# Checking Asian Columns
cities_1990_race_df.iloc[:, 3:14].head()

Unnamed: 0,P0070006,P0070007,P0070008,P0070009,P0070010,P0070011,P0070012,P0070013,P0070014,P0070015,P0070016
24,493,63,36,297,127,34,1,0,25,16,40
40,493,92,64,309,128,188,29,0,0,29,121
112,44,24,91,43,21,173,0,0,0,5,15
218,124,26,35,144,70,15,0,0,0,7,50
221,776,174,261,946,761,166,23,0,53,47,135


In [100]:
# Calculation for Asian
# In this, I grouped them together when I pulled API. So, using column index, I will calculate each race


cities_1990_race_df['pop_asian'] = cities_1990_race_df.iloc[:, 3:14].sum(axis=1)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """


In [101]:
cities_1990_race_df.head()

Unnamed: 0,NAME,pop_white,pop_black,P0070006,P0070007,P0070008,P0070009,P0070010,P0070011,P0070012,P0070013,P0070014,P0070015,P0070016,P0070003,P0070004,P0070005,P0070017,P0070018,P0070019,P0070020,P0070021,P0070022,P0070023,P0070024,fips_state,fips_place,fips_state_place,pop_asian
24,Auburn city,27016,5531,493,63,36,297,127,34,1,0,25,16,40,51,3,6,2,1,0,0,3,0,0,0,1,3076,103076,1132
40,Birmingham city,95655,168277,493,92,64,309,128,188,29,0,0,29,121,295,13,13,11,5,0,0,4,0,2,3,1,7000,107000,1453
112,Dothan city,38312,14639,44,24,91,43,21,173,0,0,0,5,15,136,0,0,7,0,0,0,0,0,1,0,1,21184,121184,416
218,Hoover city,37886,1318,124,26,35,144,70,15,0,0,0,7,50,45,1,1,2,1,0,0,0,0,0,0,1,35896,135896,471
221,Huntsville city,116065,39016,776,174,261,946,761,166,23,0,53,47,135,800,11,5,48,2,0,2,32,1,0,5,1,37000,137000,3342


In [102]:
# Check American Indian and Alaska Native

cities_1990_race_df.iloc[:, 14:17].head()

Unnamed: 0,P0070003,P0070004,P0070005
24,51,3,6
40,295,13,13
112,136,0,0
218,45,1,1
221,800,11,5


In [103]:
# Check Native Hawaiian and Other Pacific Islander

cities_1990_race_df.iloc[:, 17:25].head()

Unnamed: 0,P0070017,P0070018,P0070019,P0070020,P0070021,P0070022,P0070023,P0070024
24,2,1,0,0,3,0,0,0
40,11,5,0,0,4,0,2,3
112,7,0,0,0,0,0,1,0
218,2,1,0,0,0,0,0,0
221,48,2,0,2,32,1,0,5


In [104]:
# Calculation for American Indian and Alaska Native

cities_1990_race_df['pop_AIAN'] = cities_1990_race_df.iloc[:, 14:17].sum(axis=1)


# Calcuation for Native Hawaiian and Other Pacific Islander

cities_1990_race_df['pop_NHOPI'] = cities_1990_race_df.iloc[:, 17:25].sum(axis=1)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


In [105]:
cities_1990_race_df.head()

Unnamed: 0,NAME,pop_white,pop_black,P0070006,P0070007,P0070008,P0070009,P0070010,P0070011,P0070012,P0070013,P0070014,P0070015,P0070016,P0070003,P0070004,P0070005,P0070017,P0070018,P0070019,P0070020,P0070021,P0070022,P0070023,P0070024,fips_state,fips_place,fips_state_place,pop_asian,pop_AIAN,pop_NHOPI
24,Auburn city,27016,5531,493,63,36,297,127,34,1,0,25,16,40,51,3,6,2,1,0,0,3,0,0,0,1,3076,103076,1132,60,6
40,Birmingham city,95655,168277,493,92,64,309,128,188,29,0,0,29,121,295,13,13,11,5,0,0,4,0,2,3,1,7000,107000,1453,321,25
112,Dothan city,38312,14639,44,24,91,43,21,173,0,0,0,5,15,136,0,0,7,0,0,0,0,0,1,0,1,21184,121184,416,136,8
218,Hoover city,37886,1318,124,26,35,144,70,15,0,0,0,7,50,45,1,1,2,1,0,0,0,0,0,0,1,35896,135896,471,47,3
221,Huntsville city,116065,39016,776,174,261,946,761,166,23,0,53,47,135,800,11,5,48,2,0,2,32,1,0,5,1,37000,137000,3342,816,90


In [106]:
# Take the columns that we need for final df.

cities_1990_race_df = cities_1990_race_df[['NAME','pop_white','pop_black', 'pop_asian','pop_AIAN','pop_NHOPI','fips_state_place']]

In [107]:
cities_1990_race_df.head()

Unnamed: 0,NAME,pop_white,pop_black,pop_asian,pop_AIAN,pop_NHOPI,fips_state_place
24,Auburn city,27016,5531,1132,60,6,103076
40,Birmingham city,95655,168277,1453,321,25,107000
112,Dothan city,38312,14639,416,136,8,121184
218,Hoover city,37886,1318,471,47,3,135896
221,Huntsville city,116065,39016,3342,816,90,137000


In [108]:
# Merge two dfs into one

cities_1990_df2 = pd.merge(cities_1990_df, cities_1990_race_df, on=['NAME', 'fips_state_place'])

In [109]:
cities_1990_df2.head()

Unnamed: 0,NAME,census_population,pop_male,pop_female,pop_nonhispanic,pop_hispanic,fips_state,fips_place,Year,fips_state_place,NYU_500_Largest,pop_white,pop_black,pop_asian,pop_AIAN,pop_NHOPI
0,Auburn city,33830,17143,16687,33516,314,1,3076,1990,103076,0,27016,5531,1132,60,6
1,Birmingham city,265968,121437,144531,264930,1038,1,7000,1990,107000,1,95655,168277,1453,321,25
2,Dothan city,53589,25154,28435,53230,359,1,21184,1990,121184,0,38312,14639,416,136,8
3,Hoover city,39788,18823,20965,39422,366,1,35896,1990,135896,1,37886,1318,471,47,3
4,Huntsville city,159789,77676,82113,157810,1979,1,37000,1990,137000,1,116065,39016,3342,816,90





## Age Group in 1990

For 1990, there is no grouped age group. So, we need to add different age groups to get the age groups that we want. 
<br>
The following are the columns that I will pull from the Census API and combine them. Columns would provide the columns that I combine for our columns. 
<Br>


Our Column | Census SF1 Column | SF 1 Column Concept | Formula
-- | -- | -- | --
15 to 44 years | P0110010     P0110011     P0110012     P0110013     P0110014     P0110015     P0110016     P0110017     P0110018     P0110019     P0110020     P0110021 | Persons Age 15 yrs     Persons Age 16 yrs     Persons Age 17 yrs     Persons Age 18 yrs     Persons Age 19 yrs     Persons Age 20 yrs     Persons Age 21 yrs     Persons Age 22-24 yrs     Persons Age 25-29 yrs     Persons Age 30-34 yrs     Persons Age 35-39 yrs     Persons Age 40-44 yrs | 100*(Sum of these columns/total population)
18 years and over | P0110013     P0110014     P0110015     P0110016     P0110017     P0110018     P0110019     P0110020     P0110021     P0110022     P0110023     P0110024     P0110025     P0110026     P0110027     P0110028     P0110029     P0110030     P0110031 | Persons Age 18 yrs     Persons Age 19 yrs     Persons Age 20 yrs     Persons Age 21 yrs     Persons Age 22-24 yrs     Persons Age 25-29 yrs     Persons Age 30-34 yrs     Persons Age 35-39 yrs     Persons Age 40-44 yrs     Persons Age 45-49 yrs     Persons Age 50-54 yrs     Persons Age 55-59 yrs     Persons Age 60-61 yrs     Persons Age 62-64 yrs     Persons Age 65-69 yrs     Persons Age 70-74 yrs     Persons Age 75-79 yrs     Persons Age 80-84 yrs     Persons Age 85+ yrs | 100*(Sum of these columns/total population)
65 years and over | P0110027     P0110028     P0110029     P0110030     P0110031 | Persons Age 65-69 yrs     Persons Age 70-74 yrs     Persons Age 75-79 yrs     Persons Age 80-84 yrs     Persons Age 85+ yrs | 100*(Sum of these columns/total population)


In [110]:
# create col key
lst = []
for num in range(10, 32):
    lst.append('P01100{}'.format(num))
    
cols = "ANPSADPI,P0010001,"+",".join(lst)


In [111]:
cols

'ANPSADPI,P0010001,P0110010,P0110011,P0110012,P0110013,P0110014,P0110015,P0110016,P0110017,P0110018,P0110019,P0110020,P0110021,P0110022,P0110023,P0110024,P0110025,P0110026,P0110027,P0110028,P0110029,P0110030,P0110031'

In [112]:
year='1990'
dsource='sf1'

base_url = f'https://api.census.gov/data/{year}/{dsource}?get={cols}'
data_url= f'{base_url}&for=place:*&in=state:*&key={api_key}'
print(data_url)
response = requests.get(data_url)
data=response.json()
# Check how many place code in the each API
print(year, len(data))

# Make First line into columns
cities_1990_age = pd.DataFrame(data[1:], columns=data[0])

https://api.census.gov/data/1990/sf1?get=ANPSADPI,P0010001,P0110010,P0110011,P0110012,P0110013,P0110014,P0110015,P0110016,P0110017,P0110018,P0110019,P0110020,P0110021,P0110022,P0110023,P0110024,P0110025,P0110026,P0110027,P0110028,P0110029,P0110030,P0110031&for=place:*&in=state:*&key=a45a33bca80b5e1907adc4587be5c346b897dda7
1990 23436


In [113]:
cities_1990_age['fips_state_place'] = cities_1990_age['state']+cities_1990_age['place']

# Filter only NYU and ACS1 Cities
cities_1990_age_df = cities_1990_age[cities_1990_age['fips_state_place'].isin(result_cities)]

In [114]:
cities_1990_age_df.head()

Unnamed: 0,ANPSADPI,P0010001,P0110010,P0110011,P0110012,P0110013,P0110014,P0110015,P0110016,P0110017,P0110018,P0110019,P0110020,P0110021,P0110022,P0110023,P0110024,P0110025,P0110026,P0110027,P0110028,P0110029,P0110030,P0110031,state,place,fips_state_place
24,Auburn city,33830,248,235,269,1565,3219,3328,3233,4791,2631,1914,1623,1343,1087,808,737,247,394,616,468,382,250,191,1,3076,103076
40,Birmingham city,265968,3490,3394,3592,3565,4196,4182,4078,12707,23565,23480,21138,16505,11816,10744,11023,4747,7490,11975,9552,7965,5536,4452,1,7000,107000
112,Dothan city,53589,853,807,847,758,795,693,689,2033,4090,4477,4402,3787,3137,2490,2302,922,1486,2205,1819,1296,864,597,1,21184,121184
218,Hoover city,39788,457,483,494,437,440,478,457,1856,3707,3615,3698,3618,2689,2053,1714,764,1026,1543,988,690,390,307,1,35896,135896
221,Huntsville city,159789,1914,2009,2153,2776,3034,2914,2595,8378,15620,14219,11865,11296,9529,9186,8470,2807,4071,6026,4077,2756,1772,1351,1,37000,137000


In [115]:
cities_1990_age_df.columns

Index(['ANPSADPI', 'P0010001', 'P0110010', 'P0110011', 'P0110012', 'P0110013',
       'P0110014', 'P0110015', 'P0110016', 'P0110017', 'P0110018', 'P0110019',
       'P0110020', 'P0110021', 'P0110022', 'P0110023', 'P0110024', 'P0110025',
       'P0110026', 'P0110027', 'P0110028', 'P0110029', 'P0110030', 'P0110031',
       'state', 'place', 'fips_state_place'],
      dtype='object')

In [116]:
# Before Calculation, change column type from string to int.
age_col = ['P0010001', 'P0110010', 'P0110011', 'P0110012', 'P0110013',
       'P0110014', 'P0110015', 'P0110016', 'P0110017', 'P0110018', 'P0110019',
       'P0110020', 'P0110021', 'P0110022', 'P0110023', 'P0110024', 'P0110025',
       'P0110026', 'P0110027', 'P0110028', 'P0110029', 'P0110030', 'P0110031']


for col in age_col:
    cities_1990_age_df[col]=cities_1990_age_df[col].astype(int)
    

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  if __name__ == '__main__':


In [117]:
cities_1990_age_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 607 entries, 24 to 23341
Data columns (total 27 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   ANPSADPI          607 non-null    object
 1   P0010001          607 non-null    int64 
 2   P0110010          607 non-null    int64 
 3   P0110011          607 non-null    int64 
 4   P0110012          607 non-null    int64 
 5   P0110013          607 non-null    int64 
 6   P0110014          607 non-null    int64 
 7   P0110015          607 non-null    int64 
 8   P0110016          607 non-null    int64 
 9   P0110017          607 non-null    int64 
 10  P0110018          607 non-null    int64 
 11  P0110019          607 non-null    int64 
 12  P0110020          607 non-null    int64 
 13  P0110021          607 non-null    int64 
 14  P0110022          607 non-null    int64 
 15  P0110023          607 non-null    int64 
 16  P0110024          607 non-null    int64 
 17  P0110025     

In [118]:
# Checking columns for 15-44 

cities_1990_age_df.iloc[:, 2:14].head()

Unnamed: 0,P0110010,P0110011,P0110012,P0110013,P0110014,P0110015,P0110016,P0110017,P0110018,P0110019,P0110020,P0110021
24,248,235,269,1565,3219,3328,3233,4791,2631,1914,1623,1343
40,3490,3394,3592,3565,4196,4182,4078,12707,23565,23480,21138,16505
112,853,807,847,758,795,693,689,2033,4090,4477,4402,3787
218,457,483,494,437,440,478,457,1856,3707,3615,3698,3618
221,1914,2009,2153,2776,3034,2914,2595,8378,15620,14219,11865,11296


In [119]:
# Checking columns for 18 +

cities_1990_age_df.iloc[:, 5:24].head()

Unnamed: 0,P0110013,P0110014,P0110015,P0110016,P0110017,P0110018,P0110019,P0110020,P0110021,P0110022,P0110023,P0110024,P0110025,P0110026,P0110027,P0110028,P0110029,P0110030,P0110031
24,1565,3219,3328,3233,4791,2631,1914,1623,1343,1087,808,737,247,394,616,468,382,250,191
40,3565,4196,4182,4078,12707,23565,23480,21138,16505,11816,10744,11023,4747,7490,11975,9552,7965,5536,4452
112,758,795,693,689,2033,4090,4477,4402,3787,3137,2490,2302,922,1486,2205,1819,1296,864,597
218,437,440,478,457,1856,3707,3615,3698,3618,2689,2053,1714,764,1026,1543,988,690,390,307
221,2776,3034,2914,2595,8378,15620,14219,11865,11296,9529,9186,8470,2807,4071,6026,4077,2756,1772,1351


In [120]:
# checking columns for 65+
cities_1990_age_df.iloc[:, 19:24].head()

Unnamed: 0,P0110027,P0110028,P0110029,P0110030,P0110031
24,616,468,382,250,191
40,11975,9552,7965,5536,4452
112,2205,1819,1296,864,597
218,1543,988,690,390,307
221,6026,4077,2756,1772,1351


In [121]:

# Create a column for 15 to 44 years
cities_1990_age_df['15 to 44 years'] = 100*round((cities_1990_age_df.iloc[:, 2:14].sum(axis=1)/cities_1990_age_df.iloc[:,1]),4)

# Create a column for 18 +
cities_1990_age_df['18 years and over'] =100*round((cities_1990_age_df.iloc[:, 5:24].sum(axis=1)/cities_1990_age_df.iloc[:,1]),4)

# Create a column for 65 + 
cities_1990_age_df['65 years and over'] = 100*round((cities_1990_age_df.iloc[:, 19:24].sum(axis=1)/cities_1990_age_df.iloc[:,1]),4)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


In [122]:
cities_1990_age_df.head()

Unnamed: 0,ANPSADPI,P0010001,P0110010,P0110011,P0110012,P0110013,P0110014,P0110015,P0110016,P0110017,P0110018,P0110019,P0110020,P0110021,P0110022,P0110023,P0110024,P0110025,P0110026,P0110027,P0110028,P0110029,P0110030,P0110031,state,place,fips_state_place,15 to 44 years,18 years and over,65 years and over
24,Auburn city,33830,248,235,269,1565,3219,3328,3233,4791,2631,1914,1623,1343,1087,808,737,247,394,616,468,382,250,191,1,3076,103076,72.12,85.21,5.64
40,Birmingham city,265968,3490,3394,3592,3565,4196,4182,4078,12707,23565,23480,21138,16505,11816,10744,11023,4747,7490,11975,9552,7965,5536,4452,1,7000,107000,46.58,74.71,14.84
112,Dothan city,53589,853,807,847,758,795,693,689,2033,4090,4477,4402,3787,3137,2490,2302,922,1486,2205,1819,1296,864,597,1,21184,121184,45.22,72.48,12.65
218,Hoover city,39788,457,483,494,437,440,478,457,1856,3707,3615,3698,3618,2689,2053,1714,764,1026,1543,988,690,390,307,1,35896,135896,49.61,76.58,9.85
221,Huntsville city,159789,1914,2009,2153,2776,3034,2914,2595,8378,15620,14219,11865,11296,9529,9186,8470,2807,4071,6026,4077,2756,1772,1351,1,37000,137000,49.3,76.82,10.0


In [123]:
# filter only necessary columns

cities_1990_age_df= cities_1990_age_df[['fips_state_place','15 to 44 years','18 years and over','65 years and over']]

In [124]:
# Merge two dfs into one

cities_1990_df3 = pd.merge(cities_1990_df2,cities_1990_age_df, on= 'fips_state_place')

In [125]:
cities_1990_df3.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 607 entries, 0 to 606
Data columns (total 19 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   NAME               607 non-null    object 
 1   census_population  607 non-null    object 
 2   pop_male           607 non-null    object 
 3   pop_female         607 non-null    object 
 4   pop_nonhispanic    607 non-null    object 
 5   pop_hispanic       607 non-null    object 
 6   fips_state         607 non-null    object 
 7   fips_place         607 non-null    object 
 8   Year               607 non-null    int64  
 9   fips_state_place   607 non-null    object 
 10  NYU_500_Largest    607 non-null    int64  
 11  pop_white          607 non-null    int64  
 12  pop_black          607 non-null    int64  
 13  pop_asian          607 non-null    int64  
 14  pop_AIAN           607 non-null    int64  
 15  pop_NHOPI          607 non-null    int64  
 16  15 to 44 years     607 non

In [126]:
# Assign column type from string to int for numeric columns
cities_1990_df3['census_population'] = cities_1990_df3['census_population'].astype(int)
cities_1990_df3['pop_male'] = cities_1990_df3['pop_male'].astype(int)
cities_1990_df3['pop_female'] = cities_1990_df3['pop_female'].astype(int)
cities_1990_df3['pop_nonhispanic'] = cities_1990_df3['pop_nonhispanic'].astype(int)
cities_1990_df3['pop_hispanic'] = cities_1990_df3['pop_hispanic'].astype(int)

In [127]:
cities_1990_df3.head()

Unnamed: 0,NAME,census_population,pop_male,pop_female,pop_nonhispanic,pop_hispanic,fips_state,fips_place,Year,fips_state_place,NYU_500_Largest,pop_white,pop_black,pop_asian,pop_AIAN,pop_NHOPI,15 to 44 years,18 years and over,65 years and over
0,Auburn city,33830,17143,16687,33516,314,1,3076,1990,103076,0,27016,5531,1132,60,6,72.12,85.21,5.64
1,Birmingham city,265968,121437,144531,264930,1038,1,7000,1990,107000,1,95655,168277,1453,321,25,46.58,74.71,14.84
2,Dothan city,53589,25154,28435,53230,359,1,21184,1990,121184,0,38312,14639,416,136,8,45.22,72.48,12.65
3,Hoover city,39788,18823,20965,39422,366,1,35896,1990,135896,1,37886,1318,471,47,3,49.61,76.58,9.85
4,Huntsville city,159789,77676,82113,157810,1979,1,37000,1990,137000,1,116065,39016,3342,816,90,49.3,76.82,10.0


In [128]:
cities_1990_df3.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 607 entries, 0 to 606
Data columns (total 19 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   NAME               607 non-null    object 
 1   census_population  607 non-null    int64  
 2   pop_male           607 non-null    int64  
 3   pop_female         607 non-null    int64  
 4   pop_nonhispanic    607 non-null    int64  
 5   pop_hispanic       607 non-null    int64  
 6   fips_state         607 non-null    object 
 7   fips_place         607 non-null    object 
 8   Year               607 non-null    int64  
 9   fips_state_place   607 non-null    object 
 10  NYU_500_Largest    607 non-null    int64  
 11  pop_white          607 non-null    int64  
 12  pop_black          607 non-null    int64  
 13  pop_asian          607 non-null    int64  
 14  pop_AIAN           607 non-null    int64  
 15  pop_NHOPI          607 non-null    int64  
 16  15 to 44 years     607 non

In [129]:
# Add a new column named 'st_name' 
cities_1990_df3['st_name'] = cities_1990_df3['fips_state'].apply(set_value, args =(state_dic1, ))


cities_1990_df3.head()

Unnamed: 0,NAME,census_population,pop_male,pop_female,pop_nonhispanic,pop_hispanic,fips_state,fips_place,Year,fips_state_place,NYU_500_Largest,pop_white,pop_black,pop_asian,pop_AIAN,pop_NHOPI,15 to 44 years,18 years and over,65 years and over,st_name
0,Auburn city,33830,17143,16687,33516,314,1,3076,1990,103076,0,27016,5531,1132,60,6,72.12,85.21,5.64,Alabama
1,Birmingham city,265968,121437,144531,264930,1038,1,7000,1990,107000,1,95655,168277,1453,321,25,46.58,74.71,14.84,Alabama
2,Dothan city,53589,25154,28435,53230,359,1,21184,1990,121184,0,38312,14639,416,136,8,45.22,72.48,12.65,Alabama
3,Hoover city,39788,18823,20965,39422,366,1,35896,1990,135896,1,37886,1318,471,47,3,49.61,76.58,9.85,Alabama
4,Huntsville city,159789,77676,82113,157810,1979,1,37000,1990,137000,1,116065,39016,3342,816,90,49.3,76.82,10.0,Alabama


In [130]:
# Change column name to our name

cities_1990_df3.rename(columns={"NAME": 'city_name'}, inplace=True)

In [131]:

# Re-order the dataset
cities_1990_df3 = cities_1990_df3[['fips_state','fips_place','fips_state_place','city_name','st_name',\
                                                   'NYU_500_Largest','Year','census_population',\
                                                   '15 to 44 years','18 years and over', '65 years and over',\
                                                   'pop_male', 'pop_female',\
                                                   'pop_white', 'pop_black', 'pop_asian','pop_AIAN', 'pop_NHOPI',\
                                                   'pop_nonhispanic', 'pop_hispanic']]

# Create CSV and Excel File
cities_1990_df3.to_csv("place_pop_1990_asrh.csv", index=False)
cities_1990_df3.to_excel("place_pop_1990_asrh.xlsx", index=False)

In [132]:

df_1990_2018 = pd.concat([cities_1990_df3,df_2000_2018])

df_1990_2018.head()

Unnamed: 0,fips_state,fips_place,fips_state_place,city_name,st_name,NYU_500_Largest,Year,census_population,15 to 44 years,18 years and over,65 years and over,pop_male,pop_female,pop_white,pop_black,pop_asian,pop_AIAN,pop_NHOPI,pop_nonhispanic,pop_hispanic
0,1,3076,103076,Auburn city,Alabama,0,1990,33830,72.12,85.21,5.64,17143,16687,27016,5531,1132,60,6,33516,314
1,1,7000,107000,Birmingham city,Alabama,1,1990,265968,46.58,74.71,14.84,121437,144531,95655,168277,1453,321,25,264930,1038
2,1,21184,121184,Dothan city,Alabama,0,1990,53589,45.22,72.48,12.65,25154,28435,38312,14639,416,136,8,53230,359
3,1,35896,135896,Hoover city,Alabama,1,1990,39788,49.61,76.58,9.85,18823,20965,37886,1318,471,47,3,39422,366
4,1,37000,137000,Huntsville city,Alabama,1,1990,159789,49.3,76.82,10.0,77676,82113,116065,39016,3342,816,90,157810,1979


In [133]:
df_1990_2018.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 9137 entries, 0 to 7904
Data columns (total 20 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   fips_state         9137 non-null   object 
 1   fips_place         9137 non-null   object 
 2   fips_state_place   9137 non-null   object 
 3   city_name          9137 non-null   object 
 4   st_name            9137 non-null   object 
 5   NYU_500_Largest    9137 non-null   int64  
 6   Year               9137 non-null   int64  
 7   census_population  9137 non-null   int64  
 8   15 to 44 years     9137 non-null   float64
 9   18 years and over  9137 non-null   float64
 10  65 years and over  9137 non-null   float64
 11  pop_male           9137 non-null   int64  
 12  pop_female         9137 non-null   int64  
 13  pop_white          9137 non-null   int64  
 14  pop_black          9137 non-null   int64  
 15  pop_asian          9137 non-null   int64  
 16  pop_AIAN           9137 

In [134]:
# Combine 1990- 2018


# Create New files
df_1990_2018.to_csv("place_pop_all_year_asrh.csv", index=False)
df_1990_2018.to_excel("place_pop_all_year_asrh.xlsx", index=False)