# County-to-county Inflow Migration US Map

### Documentation for network analysis on inflow migrations

### Sections
1. Import modules needed
2. Data preprocessing
3. Preprocess data for US map
4. Create US Map

### Main Goals

* Redo Ohio county inflow graph in Python
* Create reproduceable code that makes inflow community graphs for every state
* Later on: look at longer distance moves, so maybe going to counties from other states


# Section 1: Import Modules Needed

In [110]:
import pandas as pd
import numpy as np
import networkx as nx
import matplotlib.pyplot as plt
import pylab
from cdlib import algorithms,viz
from matplotlib import colormaps
from urllib.request import urlopen
import json
import plotly.express as px
import leidenalg as la
import igraph as ig

# Section 2: Data Preprocessing

This section has the function that: 

* Imports the csv county data from the IRS website
* Changes column names to standardize them in case we use different `inflow` files from other years
* Filters out total migration, non-migrants, and foreign migrants
* Filters out rows that have 40 or less migrants total
* Creates target county names and target county states through an inner join
* Creates our node and edge list of counties and states that can be used in our network graphs


## Standardize the TargetStateFips column of our TotalEdgeList, to match the FIPS of the USA graph values
* TargetCountyFIPS that have two digits will have a leading zero, one digit will have two leading zeros
* TargetStateFIPS that have one digit will have one leading zero
* Concat these two values into one column `TotalFips`

In [113]:
# Pass in the url for our county inflow csv from IRS site
def CleanData(url):
    raw_df = pd.read_csv(url, encoding='latin-1')
    df = pd.DataFrame(raw_df)

    # Change column names based on IRS info
    df.rename(columns={'y2_statefips': 'TargetStateFips', 'y2_countyfips': 'TargetCountyFips', 'y1_statefips': 'OriginStateFips',
                       'y1_countyfips': 'OriginCountyFips', 'y1_state': 'OriginState', 'y1_countyname': 'OriginCountyName',
                       'n1': 'ReturnNum', 'n2': 'IndividualsNum'}, inplace=True)

    # Filter out all origin state fips greater than 56
    df = df[df['OriginStateFips'] <= 56]
    # Remove rows that have 40 or less migrants
    df = df[df['ReturnNum'] > 40]
    # Filter our origin county names that have the strings: Non-migrants, and Foreign
    df = df[~df['OriginCountyName'].str.contains('Non-migrants|Foreign')]

    # Create target countyname and target state name. We need these target county names to add into our edgelist as the 'Target'
    TargetTable = df[['OriginCountyName', 'OriginState', 'OriginStateFips', 'OriginCountyFips']].copy()
    TargetTable.drop_duplicates(inplace=True)

    # Rename columns for the target information
    TargetTable.rename(columns={'OriginCountyName': 'TargetCountyName', 'OriginState': 'TargetState', 
                                'OriginStateFips': 'TargetStateFips', 'OriginCountyFips': 'TargetCountyFips'}, inplace=True)

    # Merge with the main dataframe
    Merge = pd.merge(
        df, 
        TargetTable[['TargetStateFips', 'TargetCountyFips', 'TargetState', 'TargetCountyName']], 
        left_on=['TargetStateFips', 'TargetCountyFips'], 
        right_on=['TargetStateFips', 'TargetCountyFips'],
        how='inner' 
    )

    # Remove 'County' or 'county' from OriginCountyName and TargetCountyName
    Merge['OriginCountyName'] = Merge['OriginCountyName'].str.replace('County', '', case=False)
    Merge['TargetCountyName'] = Merge['TargetCountyName'].str.replace('County', '', case=False)

    # Create a node and edgelist of counties within the states. 
    TotalNodeList = Merge[['TargetCountyFips', 'TargetCountyName', 'TargetState']]
    TotalNodeList = TotalNodeList.sort_values(by='TargetCountyFips')
    TotalNodeList.rename(columns={'TargetCountyFips': 'CountyFips', 'TargetCountyName': 'CountyName', 'TargetState': 'State'}, inplace=True)
    TotalNodeList.drop_duplicates(inplace=True)


    TotalEdgeList = Merge[['OriginState', 'OriginCountyName', 'TargetState', 'TargetCountyName', 'ReturnNum', 'agi', 'OriginCountyFips',
                           'TargetCountyFips', 'TargetStateFips', 'OriginStateFips']]
    TotalEdgeList.drop_duplicates(inplace=True)
    TotalEdgeList.sort_values(by=['OriginState', 'OriginCountyName'], inplace=True)

    # Adjusting FIPS columns in TotalEdgeList
    TotalEdgeList['TargetCountyFips'] = TotalEdgeList['TargetCountyFips'].astype(str).str.zfill(3)  # Ensure 3 digits
    TotalEdgeList['TargetStateFips'] = TotalEdgeList['TargetStateFips'].astype(str).str.zfill(2)    # Ensure 2 digits

    # Create TargetTotalFips by concatenating TargetStateFips and TargetCountyFips
    TotalEdgeList['TargetTotalFips'] = TotalEdgeList['TargetStateFips'] + TotalEdgeList['TargetCountyFips']

    # Preview the modified DataFrame
    print('You can now access "TotalNodeList" and "TotalEdgeList" tables for further use!')
    return TotalNodeList, TotalEdgeList, TargetTable
    
# Running this function will return the cleaned Node and Edge list required for our network graphs! 
# We can pass in any county-inflow csv link through the IRS site
TotalNodeList, TotalEdgeList, TargetTable = CleanData(url='https://www.irs.gov/pub/irs-soi/countyinflow2122.csv')

You can now access "TotalNodeList" and "TotalEdgeList" tables for further use!


# Section 3: Preprocess data for US map

In this section:
* Recreate state graph with the total graph of the united states
* Create a seperate community dataframe that has the county name and state of county, with the community it belongs in
* Join the community dataframe onto the TotalEdgeList to identify which county belongs to each community
* Pass in the new edge dataframe that has community labels onto the graph, pass this as `CountyDF`

In [115]:
def CleanUSData(TotalEdgeList):
    # Generate the graph based on the total edge list
    G = nx.from_pandas_edgelist(TotalEdgeList, source='OriginCountyName', target='TargetCountyName', edge_attr='ReturnNum')
    
    # Use the CDLib library to get our Louvain algorithm for community detection
    communities = algorithms.leiden(G)
    
    # Create a dictionary mapping each node (county) to its community
    CommunityDict = {node: cid for cid, community in enumerate(communities.communities) for node in community}
    
    # Convert the community dictionary into a DataFrame
    CommunityDf = pd.DataFrame(list(CommunityDict.items()), columns=['CountyName', 'Community'])
    
    # Sort the dataframe by community for easier interpretation
    CommunityDf.sort_values('Community', inplace=True)
    
    # Reset the index for a clean presentation
    CommunityDf.reset_index(drop=True, inplace=True)
    
    # Join the community dataframe onto the TotalEdgeList to identify which county belongs to each community
    CountyDf = pd.merge(
                    TotalEdgeList,
                    CommunityDf,
                    how='inner',
                    left_on='TargetCountyName',
                    right_on='CountyName',
                )
    
    print('You can now access CountyDf to create the US Map')
    return G, CountyDf, CommunityDf
    
# Output
G, CountyDf, CommunityDf = CleanUSData(TotalEdgeList)

You can now access CountyDf to create the US Map


# Section 4: Create US Map 

The goal of this section is to:
* Create a chlorepleth map showing communities across the US.
* The counties are discretly colored, so different colors represent different communities
* Plotly creates an interactive map so you can filter out certain commmunities

Later on, I will add labels to identify every county, and for now I took out edges between counties since it causes a memory issue.

In [121]:
# Download a json file with counties and fips
#with urlopen('https://raw.githubusercontent.com/plotly/datasets/master/geojson-counties-fips.json') as response:
#    counties = json.load(response)


# Ensure the Community column is treated as categorical
#CountyDf['Community'] = CountyDf['Community'].astype(str)

# Create the choropleth map with discrete colors
#fig = px.choropleth(
#    CountyDf,
#    geojson=counties,
#    locations='TargetTotalFips',  # Column with FIPS codes
#    color='Community',           # Column to assign discrete colors
#    color_discrete_sequence=px.colors.qualitative.Vivid,  # Discrete color scale
#    scope='usa',
#    labels={'Community': 'Community'},
#    title='Leiden Communities by US Counties',
#)

# Update layout for better appearance
#fig.update_layout(margin=dict(l=60, r=60, t=50, b=50))

# Show the map
#fig.show()
#fig.write_html("us_county_map.html")