# Some other race

Predictions:

1. The upcoming propsal for changing the census is essentially adding hispanic as a race (because the ethnicity question is being eliminated)
2. Middle Eastern/North African

This means that the categories (minimum categories) will be:
1. American Indian/Alaska Native
2. Asian
3. Black/African American
4. Hispanic/Latino
5. Middle Eastern/North African
6. Native Hawaiian / Pacific Islander
7. White

What would be good for predictions?
1. Some maps--looking at the ACS data to see where people fall under Hispanic and thus predict a shift in responses for 2030 ethnicity
2. Looking specifically at country level to determine how adding Middle Eastern/North African will impact the responses as well. 

# Importing

In [None]:
import json

import geopandas as gpd
import pandas as pd
import networkx as nx
import matplotlib.pyplot as plt
import requests
import us

from census import Census
from shapely.geometry import Point

import plotly.express as px

from urllib.request import urlopen
import json
with urlopen('https://raw.githubusercontent.com/plotly/datasets/master/geojson-counties-fips.json') as response:
    counties = json.load(response)

In [None]:
census = Census("", year=2020)

# Read files

When you read in files from a CSV, they convert what's supposed to be a string of numbers (the GEOID) to an integer. This is a problem because some FIPS codes begin with 0; so after we read in the files, we need to convert the FIPS to strings and add the 0s in the front if necessary (this is only needed for the first six states)

In [None]:
percents = pd.read_csv('all_race_pct_by_county.csv')

In [None]:
percents = percents.astype({'GEOID': 'str'})

In [None]:
percents.iloc[0]['GEOID'] 

In [None]:
string_fips = ['0']*len(percents)
for i in range(len(percents)):
    if len(percents.iloc[i]['GEOID']) == 4: 
        string_fips[i] = '0' + percents.iloc[i]['GEOID']
    else: 
        string_fips[i] = percents.iloc[i]['GEOID']
percents['GEOID'] = string_fips

In [None]:
percents['GEOID']

In [None]:
data_w_o = pd.read_csv('dataframe_stats_and_shape_with_ct_counties_711.csv')
data_w_o

In [None]:
test = data_w_o['white'] + data_w_o['black'] + data_w_o['amin'] + data_w_o['asian'] + data_w_o['nhpi'] + data_w_o['two_or_more'] + data_w_o['other']
data_w_o['total_pop'] = test

In [None]:
data_w_o = data_w_o.astype({'GEOID': 'str'})
string_fips = ['0']*len(data_w_o)
for i in range(len(data_w_o)):
    if len(data_w_o.iloc[i]['GEOID']) == 4: 
        string_fips[i] = '0' + data_w_o.iloc[i]['GEOID']
    else: 
        string_fips[i] = data_w_o.iloc[i]['GEOID']
data_w_o['GEOID'] = string_fips

In [None]:
data_w_o.iloc[0]['GEOID']

In [None]:
data_w_o.columns.to_list()

# Functions

In [None]:
def fips_to_string(dataframe, fips_col_name):
    dataframe = dataframe.astype({fips_col_name: 'str'})
    string_fips = ['0']*len(dataframe)
    for i in range(len(dataframe)):
        if len(dataframe.iloc[i][fips_col_name]) == 4: 
            dataframe[i] = '0' + dataframe.iloc[i][fips_col_name]
        else: 
            string_fips[i] = dataframe.iloc[i][fips_col_name]
    dataframe[fips_col_name] = string_fips
    return dataframe

In [None]:
def make_heatmap(dataframe, fips_column_name, column_name):
    maximum = dataframe[column_name].max()
    fig = px.choropleth(dataframe, geojson=counties, locations=fips_column_name, color=column_name,
                           color_continuous_scale="Viridis",
                           range_color=(0, maximum),
                           scope="usa",
                           labels={'white':'percent white pop'}
                          )
    fig.update_layout(margin={"r":0,"t":0,"l":0,"b":0})
    fig.show()

## Visualizations

In [None]:
counts = pd.read_csv('dataframe_stats_and_shape_w_ct_counties.csv')

In [None]:
type(counts.iloc[0]['GEOID'])

In [None]:
percents.columns.to_list()

In [None]:
categories = [
    "white_pct",
    "black_pct",
    "amin_pct",
    "asian_pct",
    "nhpi_pct",
    "other_pct",
    "two_or_more_pct",
    "nh_white_pct",
    "nh_black_pct",
    "nh_amin_pct",
    "nh_asian_pct",
    "nh_nhpi_pct",
    "nh_other_pct",
    "nh_two_or_more_pct",
    'h_white_pct',
    'h_black_pct',
    'h_amin_pct',
    'h_asian_pct',
    'h_nhpi_pct',
    'h_other_pct',
    'h_two_or_more_pct'
]

In [None]:
for category in categories:
    make_heatmap(percents, 'GEOID', category)

In [None]:

maximum = percents['h_black_pct'].max()
fig = px.choropleth(percents, geojson=counties, locations='GEOID', color='h_black_pct',
                       color_continuous_scale="Viridis",
                       range_color=(0, 2),
                       scope="usa",
                       labels={'white':'percent white pop'}
                      )
fig.update_layout(margin={"r":0,"t":0,"l":0,"b":0})
fig.show()

## Some notes

Let W stand for white alone, Hispanic or not. consider the following three categories.
1. Hisp + W (alone, not checking anything else)
2. Hisp + some other race (alone, not checking anything else)
3. Hisp + x for x being one of the other races we have information on.

We expect 1 to be about 50%, 2 to be about 50%, and 3 to be about epsilon. 

With this in mind, we have some questions: 
1. is this 1/2-1/2-epsilon split roughly correct nationally? Of sor alone, what share hispanic? How likely to be two or more vs two or more given that you've picked some other race?
2. what states and counties are farthest from the national trend?
3. What about the share of NH SOR; what states and counties are much higher?
4. Do the 3 categories above plus non-hispanic

EDIT: view all except X as viewing the races alone)

How does this change if you shift from total population to voting age population?

ACS categories: https://data.census.gov/table/ACSDT1Y2022.B02001?q=asian

https://data.census.gov/table?q=s2901

Starting to look at microdata: https://data.census.gov/table/ACSDT5Y2022.B05006?q=B05006

# Decennial Questions

## The 50-50-epsilon split

In [None]:
h_categories = [
    'h_white',
    'h_black',
    'h_amin',
    'h_asian',
    'h_nhpi',
    'h_other',
    'h_two_or_more'
]

In [None]:
h_sums = []
for h_cat in h_categories:
    sum_h_cat = percents[h_cat].sum()
    h_sums.append(sum_h_cat)

total_h = sum(h_sums)

In [None]:
h_sums

In [None]:
p_h_total = percents['h_total'].sum()

In [None]:
total_h

In [None]:
p_h_total

In [None]:
h_sums/total_h

In [None]:
h_sums[5]/total_h

In [None]:
hX = (h_sums/total_h)[1] + (h_sums/total_h)[2] + (h_sums/total_h)[3] + (h_sums/total_h)[4] 

In [None]:
hX

Note that as defined (see Week 2 of Moon's class notebooks for details), it is *not* the case that this is a 50-50-epsilon split; indeed, we see that it is about 20.1%. Hispanic and just some other race is about 41.4%; 33.3% of people that marked Hispanic marked at least two races (including both white and some other race). The remaining ~5% are hispanic and just one of the other selections. 

Therefore, if you assume that we are looking at white alone, this 50-50-epsilon split is incorrect. However, if you are ok with Hispanic white + hispanic and 2+ are in one group, this is indeed true (with epsilon being 5). 

### White and white & SOR 

If we want to answer the 50-50-epsilon question with the categories:
1. Hispanic and white or hispanic and white and SOR;
2. Hispanic and SOR_0
3. Hispanic and X (X standing for any other category)
We will look at the data_w_o (with white and other) dataframe. 

In [None]:
data_w_o.columns.to_list()

In [None]:
hispanic_pop = data_w_o['h_white'].sum()+data_w_o['h_black'].sum() + data_w_o['h_amin'].sum() + data_w_o['h_asian'].sum() + data_w_o['h_nhpi'].sum() + data_w_o['h_other'].sum() + data_w_o['h_two_or_more'].sum()


In [None]:
hispanic_pop

In [None]:
data_w_o['h_white_and_other'].sum() + data_w_o['h_white'].sum()

In [None]:
(data_w_o['h_white_and_other'].sum() + data_w_o['h_white'].sum())/hispanic_pop

In [None]:
data_w_o['h_other'].sum()/hispanic_pop

In [None]:
two_or_more_minus = data_w_o['h_two_or_more'].sum() - data_w_o['h_white_and_other'].sum() 

In [None]:
(data_w_o['h_black'].sum() + data_w_o['h_amin'].sum() + data_w_o['h_asian'].sum() + data_w_o['h_nhpi'].sum())/hispanic_pop


In [None]:
two_or_more_minus/hispanic_pop

Thus, we conclude that once we add the two together, we do get the 50%-50% epsilon split; in particular:
* 48.1% Hispanic and white only or hispanic and (white and SOR) only
* 41.4% Hispanic and SOR only;
* 5.5% Hispanic and two or more that are not just white and SOR
* 4.9% Hispanic and everything else. 

### Which states/counties deviate the most from this average?

In [None]:
percents.columns.to_list()

In [None]:
make_heatmap(percents, 'GEOID', 'h_white_pct')

In [None]:
make_heatmap(percents, 'GEOID', 'h_other_pct')

In [None]:
deviations= pd.DataFrame()

deviations['h_white_pct_devs'] = percents['h_white_pct'] - 20.1
deviations['GEOID'] = percents['GEOID']

In [None]:
deviations['h_white_pct_devs'].max()

In [None]:
minimum = deviations['h_white_pct_devs'].min()
maximum = deviations['h_white_pct_devs'].max()
fig = px.choropleth(deviations, geojson=counties, locations='GEOID', color='h_white_pct_devs',
                       color_continuous_scale="agsunset",
                       range_color=(minimum, maximum),
                       scope="usa",
                       #labels={'white':'percent white pop'}
                      )
fig.update_layout(margin={"r":0,"t":0,"l":0,"b":0})
fig.show()

We conclude that Texas and New Mexico are the main states that deviate from the Hispanic white only population the most (in a positive sense of the word deviation). The states/areas with the most deviation in the negative sense are: 

In [None]:
deviations['h_white_pct_devs_neg'] = deviations['h_white_pct_devs']*(-1)

In [None]:

fig = px.choropleth(deviations, geojson=counties, locations='GEOID', color='h_white_pct_devs_neg',
                       color_continuous_scale="Viridis",
                       range_color=(maximum*(-1), minimum*(-1)),
                       scope="usa",
                       #labels={'white':'percent white pop'}
                      )
fig.update_layout(margin={"r":0,"t":0,"l":0,"b":0})
fig.show()

To me, this seems to say that the values are polarized. 

The specific names:

In [None]:
largest_row_1 = deviations.nlargest(20, ['h_white_pct_devs'])

In [None]:
largest_row_1

So, the top 20 counties are:
1. Reeves County, TX
2. Zavala County, TX
3. Duval County, TX
4. Jim Hogg County, TX
5. Jim Wells County, TX
6. Garza County, TX
7. Webb County, TX
8. Willacy County, TX
9. Dimmit County, TX
10. Kleberg County, TX
11. Zapata County, TX
12. Starr County, TX
13. Brooks County, TX
14. Mora County, NM
15. Cameron County, TX
16. Frio County, TX
17. Guadalupe County, NM
18. Hidalgo County, TX
19. Val Verde County, TX
20. Maverick County, TX

Now let us look at hispanic and only chose some other race. Where are the deviations the highest?

In [None]:
deviations['h_other_pct_devs'] = 100*(percents['h_other']/percents['h_total']) - 100*(percents['h_other']/percents['h_total']).mean()





In [None]:
minimum = deviations['h_other_pct_devs'].min()
maximum = deviations['h_other_pct_devs'].max()
fig = px.choropleth(deviations, geojson=counties, locations='GEOID', color='h_other_pct_devs',
                       color_continuous_scale="agsunset",
                       range_color=(minimum, maximum),
                       scope="usa",
                       #labels={'white':'percent white pop'}
                      )
fig.update_layout(margin={"r":0,"t":0,"l":0,"b":0})
fig.show()

In [None]:
largest_row_2 = deviations.nlargest(20, ['h_other_pct_devs'])

So, the top 20 counties are:
1. Stewart County, GA
2. Aleutians East Borough, AK
3. Iberia Parish, LA
4. Franklin County, PA
5. St Mary Parish, LA
6. Caldwell Parish, LA
7. Bond County, IL
8. Effingham County, IL
9. Boone County, AR
10. Washington County, IL
11. Adams county, MO
12. Simpson County, MO
13. Houston County, GA
14. Franklin County, MO
15. Howard County, NE
16. Crawford County, GA
17. Mitchell County, NC
18. Greene County, NC
19. Jefferson County, MO
20. Wake County, NC

## Of only SOR, what percent hispanic?

In [None]:
some_other_race = pd.DataFrame()

In [None]:
some_other_race['GEOID'] = percents['GEOID']
some_other_race['some_other_race'] = percents['other']
some_other_race['nh_some_other_race'] = percents['nh_other']
some_other_race['h_some_other_race'] = percents['h_other']

In [None]:
some_other_race

In [None]:
some_other_race['h_share'] = some_other_race['h_some_other_race']/some_other_race['some_other_race']

In [None]:
some_other_race

In [None]:
make_heatmap(some_other_race, 'GEOID', 'h_share')

In [None]:
total_some_other_race = some_other_race['some_other_race'].sum()

In [None]:
total_sor_h = some_other_race['h_some_other_race'].sum()
total_sor_nh = some_other_race['nh_some_other_race'].sum()

In [None]:
total_sor_h/total_some_other_race

In [None]:
total_sor_nh/total_some_other_race

So, of those who check off ONLY some other race, 94% are hispanic.

What states/counties are high in non-hispanic some other race?

In [None]:
smallest_rows_1 = some_other_race.nsmallest(20, ['h_share'])

In [None]:
smallest_rows_1['GEOID'].to_list()

The counties with the highest non-hispanic some other race are:
1. Cache County UT
2. Grant County NE
3. McPherson County NE
4. Calhoun County WV
5. Clay County WV
6. Hinsdale County CO
7. Carter County MT
8. Garfield County NE
9. Powder River County MT
10. Wheeler County NE
11. Houston County GA
12. Kalawao County HI
13. Liberty County MT
14. Hidalgo County TX
15. Muskegon County MI
16. Jewell County KS
17. Haakon County SD
18. Bell County KY
19. Potter County SD
20. Sully County SD

## Two or more

In [None]:
percents['two_or_more']

In [None]:
make_heatmap(percents, 'GEOID', 'two_or_more_pct')

In [None]:
percents['h_two_or_more'].sum()/percents['two_or_more'].sum()

In [None]:
percents['nh_two_or_more'].sum()/percents['two_or_more'].sum()

So of the people that checked off 2 or more races, 62% checked off hispanic and 38% did not. The split, therefore, is not roughly correct. 

What if we wanted to look at choosing 2 or more and some other race?

## What if we changed to voting age population?