<a href="https://colab.research.google.com/github/moonlight1431/food_desert/blob/main/Food_Desert_Prediction_Notebook_4.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


##**Notebook #4: INTRO TO EXPLORATORY DATA ANALYSIS (EDA) & DATA CLEANING**##
---
Please make a copy of this notebook and add it the Notebook #4 folder under the Completed Notebooks folder on the Google Drive and rename the notebook as "FirstName LastName - Food Desert Prediction - Notebook #4"



##Icebreaker##


*   Spring break plans ðŸ˜Š



# **Midpoint Deliverable**

**Subteams** (either 3 pairs or 2 trios):
- Team #1: Emmy, Rin, Julina
- Team #2: Tiffany, Sidney, Abby
---
Will take place on **Tuesday, April 1st**. This is **mandatory** for all project members. The midpoint stage requires a mini presentation / screen share to be shared with fellow research teams. Each project member should speak at least once during the presentation. Here's a breakdown of what should be covered:

- **[FOR PMs] Project Overview:** Introduce the project idea, purpose, and mention the notebooks we've done so far.
- **Subteam Demonstrations:** Each subteam should showcase:
  - Current data cleaning processes and results.
  - Preliminary data analysis findings.
  - At least ONE data visualization.
  - Although the dataset all subgroups use will be the same, each of the subteam demonstrations must be unique (aka don't all make the exact same visualizations, the dataset is very big with lots of variables)
- **Challenges:** Discuss any obstacles faced and how they were addressed or plan to be tackled.
  - Team #1 will be responsible for one challenge
  - Team #2 will be responsible for one challenge
- **Technical Overview:** Detail the tech stack being utilized, including specific languages and libraries.
  - Team #1 will be responsible for this part
- **Key Data Science Questions:** Outline the primary data science questions the project aims to answer.
  - Team #1 will be responsible for one question
  - Team #2 will be responsible for one question
- **Looking Ahead:** Share the project's future objectives and anticipated milestones.
  - Team #2 will be responsible for this part

This structured approach ensures a comprehensive update to all research teams, fostering collaboration and shared learning. Your presentation length should be no more than **10 minutes**, followed by **2 minutes of Q&A** from audience.

Each group should try and focus on making 1 data visualization!

Here is an example from one of the research project teams from last semester: [[FVA] - Midpoint Deliverable](https://docs.google.com/presentation/d/1z2ciGDNI69RUSXlS6_AQepnVL9-51kAfSwYlNrIOYhk/edit?usp=sharing)

Our project spec may also be helpful to you: [Project Spec](https://docs.google.com/document/d/1CuWeaaYyZCRypNdd24m0LsQq9Z5EnEyV_0D95KbDNa8/edit?usp=sharing)

All subteams must finish their parts by **Saturday, March 29th** so we can look over them and let you guys know if any changes are necessary! Let Tanushri and Soumily know if you have any questions along the way.

# **Exploratory Data Analysis (EDA)**

This is the link to the dataset we will be working with: [Food Deserts in the U.S.](https://www.kaggle.com/datasets/tcrammond/food-access-and-food-deserts/data)

There should be two datasets that should be in the files folder to the left:
- food_access_research_atlas.csv
  - The primary file for our EDA, with all our food desert data
- food_access_variable_lookup.csv
  - File containing the full forms of the column variables for your reference. Can also just use this link if easier: [food_access_variable_lookup on Kaggle](https://www.kaggle.com/datasets/tcrammond/food-access-and-food-deserts?resource=download&select=food_access_variable_lookup.csv)

We have used df1 to denote the food_access_research_atlas.csv dataframe.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

from mpl_toolkits.mplot3d import Axes3D
import matplotlib.pyplot as plt # plotting
import numpy as np # linear algebra
import os # accessing directory structure
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns #data visualizations

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All"
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
nRowsRead = 1000 # specify 'None' if want to read whole file
# food_access_research_atlas.csv may have more rows in reality, but we are only loading/previewing the first 1000 rows
file_path = "/content/food_access_research_atlas.csv"
df1 = pd.read_csv(file_path, delimiter=',', nrows = nRowsRead)
df1.dataframeName = 'food_access_research_atlas.csv'
nRow, nCol = df1.shape
print(f'There are {nRow} rows and {nCol} columns')

**Let's take a quick look at what the data looks like:**

In [None]:
df1.head()  # Show first few rows

In [None]:
df1.info()  # Check column types & missing values

In [None]:
df1.describe()  # Summary statistics

Use the **Python libraries** we have discussed so far to explore the dataset and generate your own data visualizations. Then, in your slideshow explain what some of these visualizations reveal about the data. Look for trends across the data, correlations between certain factors, and more.


---


Here are some examples of visualizations you can make that we have discussed in previous notebooks:
- **Scatter plot**
- **Bar graph**
- **Heat map**
- **Violin plot**



---


Some Notes:
- For some visualizations you may have to use **data cleaning** to make sure your data is ready to use-- see the example of making a heatmap below!
- Data cleaning involves removing or correcting data that is incorrect, irrelevant, or improperly formatted
- If you are confused on how to create any of these, go back and reference the notebook examples!

Here is an example of a **heatmap** for this dataset to get you guys started:

Notice that:
- We had to remove the "State" and "County" columns because for the correlation matrix we want to only include columns with numerical data so that we can find the correlation coefficients
- We dropped all columns with NaN (aka null values) to avoid messing with our heatmap

In [None]:
# Heatmap
def plotCorrelationMatrix(df, graphWidth):
    df = df.dropna(axis='columns')  # Drop columns with NaN (aka null values)
    df = df[[col for col in df if df[col].nunique() > 1]]  # Keep columns with more than 1 unique value
    df = df.select_dtypes(include=['number'])  # Remove non-numeric columns
    df = df.iloc[:, :10]  # Keep only the first 10 columns <-- can change to only show the heatmap for the columns you want
    if df.shape[1] < 2:
        print(f'No correlation plots shown: The number of non-NaN or constant columns ({df.shape[1]}) is less than 2')
        return
    corr = df.corr()
    plt.figure(figsize=(8,6))
    corrMat = plt.matshow(corr)
    plt.xticks(range(len(corr.columns)), corr.columns, rotation=90)
    plt.yticks(range(len(corr.columns)), corr.columns)
    plt.gca().xaxis.tick_bottom()
    plt.colorbar(corrMat)
    plt.title(f'Correlation Matrix', fontsize=15)
    plt.show()
df_corr = df1.drop(columns=['State', 'County'])  # Keep only numeric columns (aka no state and county)
plotCorrelationMatrix(df_corr, 5)

In [None]:
# Replace with code for your own visualizations!