**If you lost points on the last checkpoint you can get them back by responding to TA/IA feedback**  

Update/change the relevant sections where you lost those points, make sure you respond on GitHub Issues to your TA/IA to call their attention to the changes you made here.

Please update your Timeline... no battle plan survives contact with the enemy, so make sure we understand how your plans have changed.

# COGS 108 - Data Checkpoint

# Names

- Ant Man
- Hulk
- Iron Man
- Thor
- Wasp

# Research Question

-  Include a specific, clear data science question.
-  Make sure what you're measuring (variables) to answer the question is clear

What is your research question? Include the specific question you're setting out to answer. This question should be specific, answerable with data, and clear. A general question with specific subquestions is permitted. (1-2 sentences)



## Background and Prior Work


- Include a general introduction to your topic
- Include explanation of what work has been done previously
- Include citations or links to previous work

This section will present the background and context of your topic and question in a few paragraphs. Include a general introduction to your topic and then describe what information you currently know about the topic after doing your initial research. Include references to other projects who have asked similar questions or approached similar problems. Explain what others have learned in their projects.

Find some relevant prior work, and reference those sources, summarizing what each did and what they learned. Even if you think you have a totally novel question, find the most similar prior work that you can and discuss how it relates to your project.

References can be research publications, but they need not be. Blogs, GitHub repositories, company websites, etc., are all viable references if they are relevant to your project. It must be clear which information comes from which references. (2-3 paragraphs, including at least 2 references)

 **Use inline citation through HTML footnotes to specify which references support which statements** 

For example: After government genocide in the 20th century, real birds were replaced with surveillance drones designed to look just like birds.<a name="cite_ref-1"></a>[<sup>1</sup>](#cite_note-1) Use a minimum of 2 or 3 citations, but we prefer more.<a name="cite_ref-2"></a>[<sup>2</sup>](#cite_note-2) You need enough to fully explain and back up important facts. 

Note that if you click a footnote number in the paragraph above it will transport you to the proper entry in the footnotes list below.  And if you click the ^ in the footnote entry, it will return you to the place in the main text where the footnote is made.

To understand the HTML here, `<a name="#..."> </a>` is a tag that allows you produce a named reference for a given location.  Markdown has the construciton `[text with hyperlink](#named reference)` that will produce a clickable link that transports you the named reference.

1. <a name="cite_note-1"></a> [^](#cite_ref-1) Lorenz, T. (9 Dec 2021) Birds Aren’t Real, or Are They? Inside a Gen Z Conspiracy Theory. *The New York Times*. https://www.nytimes.com/2021/12/09/technology/birds-arent-real-gen-z-misinformation.html 
2. <a name="cite_note-2"></a> [^](#cite_ref-2) Also refs should be important to the background, not some randomly chosen vaguely related stuff. Include a web link if possible in refs as above.


# Hypothesis



- Include your team's hypothesis
- Ensure that this hypothesis is clear to readers
- Explain why you think this will be the outcome (what was your thinking?)

What is your main hypothesis/predictions about what the answer to your question is? Briefly explain your thinking. (2-3 sentences)

# Data

## Data overview

- Dataset #1
  - Dataset Name: US Bureau of Labor Statistics Occupational Employment and Wage Estimates by Metropolitan and Nonmetropolitan Area (BLS OEWS for short)
  - Link to the dataset: [BLS OEWS](https://www.bls.gov/oes/tables.htm)\\
  Cleaned dataset: [all_year_bls](dataset_work/bls_data/useful_csvs/all_years.csv)
  - Number of observations: 74580
  - Number of variables: 23
- Dataset #2 
  - Dataset Name: School Common Data Set Degrees Conferred (CDS for short)
  - Link to the dataset: [CDS](https://docs.google.com/spreadsheets/d/19kN52hyig4k--l1GXVEiIQVoSM10L_shYSVXepYsSDM/edit?usp=sharing)\\
  This is actually a link to a Google Sheet, where you can find the link to each school's CDS\\
  Cleaned dataset: [all_year_csd](dataset_work/cds_data/useful_csvs/all_years.csv)
  - Number of observations: 3092
  - Number of variables: 7

#### BLS OEWS:
This dataset encompasses 2006-2023 employment data by metropolitan/non-metropolitan area. Our cleaned dataset has 8 areas, each of which were chosen because of their proximity to one of the schools we are analyzing. The important variables in the data set are year, location, occupation title (string), total employment (int), and average salary (int), although the other fields we left in the dataset also help with general analysis and classification. The employment count proportion and the salary may be proxies for overall desirability of the job. 

To clean this dataset, we had to download all the `.xls` and `.xlsx` files from the BLS, convert them to `.csv`s, and then pull only the relevant areas from each year's dataset. Then, to combine the datasets and standardize the columns, some columns were dropped or renamed, leading to the overall dataset. 

#### CDS:
This dataset encompasses 8 US universities, which were chosen for reasons listed in our [Google Sheet](https://docs.google.com/spreadsheets/d/19kN52hyig4k--l1GXVEiIQVoSM10L_shYSVXepYsSDM/edit?usp=sharing). The most limiting university was Colgate University, with the first CDS being from 2014, so the years span from 2014 graduation to 2023 graduation degrees conferred. This gives us a good 10 year window to work with. The important variables in this dataset are year, college, Category (major), and Bachelor's (a percentage of the graduating class). The major percentages may be proxies for the overall desirability of the major. 

To clean this dataset, we had to download all the `.pdf` files of the CDS from each school. We then created and ran a pdf scraper to pull the Degrees Conferred table from each PDF; there were a few cases where this had to be done manually because of inconsistent formatting or lack of downloadable PDF. Then, to combine the datasets, equivalent columns were combined, leading to the overall dataset.

##### Combining
To combine these datasets, it's necessary to line up majors at colleges with jobs. This means we will probably use a classifier to fit the jobs in 'OCC-TITLE' to the majors in 'Category'. Then, we can graph the relationships between employment rates and major trends. 

## BLS OEWS

In [None]:
# Import necessary libraries

import os
import pandas as pd

In [None]:
# Changes all xls files into csv files

def xls_x_to_csv(file_path, file_type):
    # Read the .xls file
    df = pd.read_excel(file_path)
    # Define the .csv file path
    csv_file_path = file_path.replace(file_type, '.csv')
    # Save the dataframe to a .csv file
    df.to_csv(csv_file_path, index=False)

def all_xls_x_to_csv(bls_data_path):
    # Iterate over each year folder in the bls_data folder
    for year_folder in os.listdir(bls_data_path):
        if year_folder == "full_processed_csvs" or year_folder == "useful_csvs":
            continue
        year_path = os.path.join(bls_data_path, year_folder)
        if os.path.isdir(year_path):
            # Iterate over each .xls file in the year folder
            for file in os.listdir(year_path):
                if file.endswith('.xls'):
                    file_path = os.path.join(year_path, file)
                    xls_x_to_csv(file_path, '.xls')
                if file.endswith('.xlsx'):
                    file_path = os.path.join(year_path, file)
                    xls_x_to_csv(file_path, '.xlsx')
                    
# Uncomment the line below to change all xls and xlsx files into csv files
# all_xls_x_to_csv('bls_data')

In [None]:
# Read through each year of csvs, and get only relevant areas into one csv
relevant_areas = ["Los Angeles-Long Beach-Santa Ana, CA", 
"Los Angeles-Long Beach-Anaheim, CA", # shifts classification in 2015
"Western New Hampshire nonmetropolitan area",
"Southwestern New Hampshire nonmetropolitan area",
"West Central New Hampshire nonmetropolitan area",
"Southwest New Hampshire nonmetropolitan area",
"West Central-Southwest New Hampshire nonmetropolitan area",
"San Diego-Carlsbad-San Marcos, CA",
"San Diego-Carlsbad, CA", # shifts classification in 2015
"Ann Arbor, MI",
"Washington-Arlington-Alexandria, DC-VA-MD-WV Metropolitan Division",
"Spokane, WA",
"Spokane-Spokane Valley, WA", # shifts classification in 2015
"Winston-Salem, NC",
"Utica-Rome, NY",
]

# All the relevant New Hampshire areas:
# "Other New Hampshire nonmetropolitan area", - this means central - do not use
# "Western New Hampshire nonmetropolitan area", - use
# "Southwestern New Hampshire nonmetropolitan area", - use
# "West Central New Hampshire nonmetropolitan area", - use - 2015
# "Central New Hampshire nonmetropolitan area", - do not use - 2015
# "Southwest New Hampshire nonmetropolitan area", - use - 2015
# "West Central-Southwest New Hampshire nonmetropolitan area", - 2018 combination of west central and southwest - only choice aside from central in 2018 onwards
# overall, we'll use West Central-Southwest, while avoiding Central?

def process_year(bls_data_path, year_path, relevant_areas, column_name, year_number):
    # Create the output directory if it doesn't exist
    processed_folder = os.path.join(bls_data_path, "full_processed_csvs")
    os.makedirs(processed_folder, exist_ok=True)

    result_df = pd.DataFrame()

    # Iterate through all CSV files in the year folder
    for file in os.listdir(year_path):
        if file.endswith('.csv'):
            file_path = os.path.join(year_path, file)
            df = pd.read_csv(file_path)
            if column_name in df.columns:
                # Filter rows where the column value is in relevant_areas
                filtered_df = df[df[column_name].isin(relevant_areas)]
                result_df = pd.concat([result_df, filtered_df], ignore_index=True)

    # Define the output file path
    output_file_path = os.path.join(processed_folder, f"{year_number}_relevant.csv")

    # Write the filtered DataFrame to the output file
    result_df.to_csv(output_file_path, index=False)
    print(f"Processed data saved to {output_file_path}")

def process_all_years(bls_data_path, relevant_areas):
    # Iterate over each year folder in the bls_data folder
    for year_folder in os.listdir(bls_data_path):
        if year_folder == "full_processed_csvs" or year_folder == "useful_csvs":
            continue
        year_number = int(year_folder)
        year_path = os.path.join(bls_data_path, year_folder)
        column_name = ''
        if os.path.isdir(year_path):
             # AREA_NAME 2006-2018, area_title in 2019, AREA_TITLE onwards
            if year_number < 2019:
                column_name = 'AREA_NAME'
            elif year_number == 2019:
                column_name = 'area_title'
            elif year_number > 2019:
                column_name = 'AREA_TITLE'
            else:
                print("Error: invalid year")
                exit(1)
            process_year(bls_data_path, year_path, relevant_areas, column_name, year_number)

process_all_years('bls_data', relevant_areas)

In [None]:
# Format all area names and column names to be consistent, more cleaning done in next step

def format_area_names(df):
    if 'AREA_NAME' in df.columns:
        df.rename(columns={'AREA_NAME': 'AREA_TITLE'}, inplace=True)
    df['AREA_TITLE'] = df['AREA_TITLE'].replace("Los Angeles-Long Beach-Santa Ana, CA", "Los Angeles-Long Beach-Anaheim, CA")
    df['AREA_TITLE'] = df['AREA_TITLE'].replace("San Diego-Carlsbad-San Marcos, CA", "San Diego-Carlsbad, CA")
    df['AREA_TITLE'] = df['AREA_TITLE'].replace("Spokane, WA", "Spokane-Spokane Valley, WA")
    df['AREA_TITLE'] = df['AREA_TITLE'].replace("Western New Hampshire nonmetropolitan area", "West Central-Southwest New Hampshire nonmetropolitan area")
    df['AREA_TITLE'] = df['AREA_TITLE'].replace("Southwestern New Hampshire nonmetropolitan area", "West Central-Southwest New Hampshire nonmetropolitan area")
    df['AREA_TITLE'] = df['AREA_TITLE'].replace("West Central New Hampshire nonmetropolitan area", "West Central-Southwest New Hampshire nonmetropolitan area")
    df['AREA_TITLE'] = df['AREA_TITLE'].replace("Southwest New Hampshire nonmetropolitan area", "West Central-Southwest New Hampshire nonmetropolitan area")
    if 'GROUP' in df.columns:
        df.rename(columns={'GROUP': 'OCC_GROUP'}, inplace=True)
    elif 'O_GROUP' in df.columns:
        df.rename(columns={'O_GROUP': 'OCC_GROUP'}, inplace=True)
    if 'LOC QUOTIENT' in df.columns:
        df.rename(columns={'LOC QUOTIENT': 'LOC_QUOTIENT'}, inplace=True)
    if 'LOC_QUOTIENT' not in df.columns:
        df['LOC_QUOTIENT'] = pd.NA
    
    return df

def format_column_names(df):
    for column in df.columns:
        df.rename(columns={column: column.upper()}, inplace=True)
    return df

def format_year(file_path):
    df = pd.read_csv(file_path)
    df = format_column_names(df)
    df = format_area_names(df)
    df.to_csv(file_path, index=False)

def format_all_years(processed_csv_path):
    for file in os.listdir(processed_csv_path):
        if file.endswith('.csv'):
            file_path = os.path.join(processed_csv_path, file)
            format_year(file_path)

format_all_years('bls_data/full_processed_csvs')

In [None]:
# Finish cleaning data by dropping useless columns

def drop_useless_columns(df):
    # Drop columns that are not needed
    drop_list = ['AREA_TYPE', 'NAICS', 'NAICS_TITLE', 'I_GROUP', 'OWN_CODE', 'OCC_CODE', 'PCT_TOTAL', 'PCT_RPT', 'PRIM_STATE', 'AREA']
    # OCC_CODE is the same as OCC_TITLE, PCT_TOTAL and PCT_RPT is always NaN, PRIM_STATE is in AREA_TITLE, AREA is same as AREA_TITLE, others cause issues
    for column in drop_list:
        if column in df.columns:
            df.drop(columns=[column], inplace=True)
    df.dropna(axis=1, how='all')
    return df

def drop_all_useless_columns(processed_csv_path):
    for file in os.listdir(processed_csv_path):
        if file.endswith('.csv'):
            file_path = os.path.join(processed_csv_path, file)
            df = pd.read_csv(file_path)
            out_df = drop_useless_columns(df)
            out_df.to_csv(os.path.join('bls_data/useful_csvs', file), index=False)

drop_all_useless_columns('bls_data/full_processed_csvs')

In [None]:
# Test to check if all years have the same columns and if all areas are represented

def test_csvs(useful_csv_path):
    reference_columns = None
    reference_area_titles = None
    for file in os.listdir(useful_csv_path):
        if file.endswith('.csv'):
            file_path = os.path.join(useful_csv_path, file)
            df = pd.read_csv(file_path)
            if reference_columns is None:
                reference_columns = set(df.columns)
            else:
                if set(df.columns) != reference_columns:
                    print(f"Warning: Columns in {file} do not match the reference columns.")
            if reference_area_titles is None:
                reference_area_titles = set(df['AREA_TITLE'].unique())
            else:
                if set(df['AREA_TITLE'].unique()) != reference_area_titles:
                    print(f"Warning: Areas in {file} do not match the reference areas.")
            unique_area_titles = df['AREA_TITLE'].nunique()
            if unique_area_titles != 8:
                print(f"Warning: {file} does not have all 8 areas represented.")
            

In [None]:
# Combine all years into one csv

def combine_years(useful_csv_path):
    combined_df = pd.DataFrame()
    for file in os.listdir(useful_csv_path):
        if file.endswith('.csv') and file != 'all_years.csv':
            file_path = os.path.join(useful_csv_path, file)
            df = pd.read_csv(file_path)
            year = int(file.split('_')[0])
            df['YEAR'] = year
            combined_df = pd.concat([combined_df, df], ignore_index=True)
    combined_df.dropna(axis=1, how='all')
    column_order = ['YEAR', 'AREA_TITLE', 'OCC_TITLE', 'OCC_GROUP', 'TOT_EMP', 'EMP_PRSE', 'JOBS_1000', 'LOC_QUOTIENT', 'H_MEAN', 'A_MEAN', 'MEAN_PRSE', 
    'H_PCT10', 'H_PCT25', 'H_MEDIAN', 'H_PCT75', 'H_PCT90', 'A_PCT10', 'A_PCT25', 'A_MEDIAN', 'A_PCT75', 'A_PCT90', 'ANNUAL', 'HOURLY']
    combined_df = combined_df[column_order]
    combined_df = combined_df.sort_values(by=['YEAR', 'AREA_TITLE'], ascending=[True, True], kind='mergesort')
    combined_df.to_csv('bls_data/useful_csvs/all_years.csv', index=False)

combine_years('bls_data/useful_csvs')

In [None]:
df = pd.read_csv('bls_data/useful_csvs/all_years.csv')
df.shape

## CDS

In [None]:
# import libraries

import pdfplumber
import pandas as pd
import numpy as np
import os

In [None]:
# Pull degrees conferred tables from CDS PDFs

def extract_table_from_pdf(pdf_path, pdf_name, full_processed_csvs_folder):
    df_results = pd.DataFrame()
    print(f"Processing file {pdf_name}")
    with pdfplumber.open(pdf_path) as pdf:
        # Iterate over the pages in the PDF
        for page_num, page in enumerate(pdf.pages, start=1):
            # Extract text content from the page
            text = page.extract_text()
            # Check if "degrees conferred" exists in the page text (case insensitive)
            if "degrees conferred" in text.lower():
                print(f"Found 'degrees conferred' on page {page_num}. Extracting table...")
                tables = page.extract_tables()
                for table in tables:
                    df = pd.DataFrame(table[1:], columns=table[0])
                    df.columns = df.columns.str.replace('\n', ' ', regex=True)
                    df_results = pd.concat([df_results, df], ignore_index=True)
                special_cases = ["ucsd_2023", "wsu_2023"]
                extra_special_cases = ["caltech_2021", "caltech_2022", "caltech_2023", "asu_2021", "asu_2022", "asu_2023"]
                if pdf_name in special_cases : # handle special case where table is split across two pages on a case-by-case basis
                    next_page = pdf.pages[page_num]
                    next_tables = next_page.extract_tables()
                    if next_tables:
                        for table in next_tables:
                            df = pd.DataFrame(table[1:], columns=table[0])
                            top_row = df.columns
                            df.columns = df_results.columns
                            df.loc[-1] = top_row 
                            df.index = df.index + 1
                            df = df.sort_index()
                            df_results = pd.concat([df_results, df], ignore_index=True)
                if pdf_name in extra_special_cases : # handle special case where table is split across two pages and the column names are on the second page
                    next_page = pdf.pages[page_num]
                    next_tables = next_page.extract_tables()
                    if next_tables:
                        for table in next_tables:
                            df = pd.DataFrame(table[1:], columns=table[0])
                            df_results = pd.concat([df_results, df], ignore_index=True)
                break
        if df_results.empty:
            print(f"'degrees conferred' not found in page {page_num}.")

    if not df_results.empty:
        df_results = df_results.reset_index(drop=True)
        year_directory = os.path.join(full_processed_csvs_folder, pdf_name.split("_")[1])
        if not os.path.exists(year_directory):
            os.makedirs(year_directory)
        file_name = os.path.join(year_directory, f"{pdf_name}_degrees_conferred.csv")
        df_results.to_csv(file_name, index=False)
        print(f"Table saved to {file_name}.")

def extract_all_tables(cds_data_folder, full_processed_csvs_folder):
    for college_folder in os.listdir(cds_data_folder):
        if college_folder == "full_processed_csvs" or college_folder == "useful_csvs" or college_folder == ".DS_Store" or college_folder == "manual":
            continue
        college_folder_path = os.path.join(cds_data_folder, college_folder)
        for pdf_file in os.listdir(college_folder_path):
            if pdf_file.endswith('.pdf'):
                pdf_file_path = os.path.join(college_folder_path, pdf_file)
                extract_table_from_pdf(pdf_file_path, pdf_file.split(".")[0], full_processed_csvs_folder)

extract_all_tables("cds_data", "cds_data/full_processed_csvs")

In [None]:
# Clean datasets by merging columns with similar names and combining data from different years

def merge_column_names(file_path):
    df = pd.read_csv(file_path)
    if "Diploma/ Certificates" in df.columns:
        df = df.rename(columns={"Diploma/ Certificates": "Diploma/Certificates"})
    elif "Diplomas / Certificates" in df.columns:
        df = df.rename(columns={"Diplomas / Certificates": "Diploma/Certificates"})
    elif "\"Diploma/\nCertificates\"" in df.columns:
        df = df.rename(columns={"\"Diploma/\nCertificates\"": "Diploma/Certificates"})
    if "CIP 2010 Categories to Include" in df.columns:
        df = df.rename(columns={"CIP 2010 Categories to Include": "CIP Code"})
    elif "CIP Code Number" in df.columns:
        df = df.rename(columns={"CIP Code Number": "CIP Code"})
    elif "CIP 2020 Categories to Include" in df.columns:
        df = df.rename(columns={"CIP 2020 Categories to Include": "CIP Code"})
    elif "CIP 2021 Categories to Include" in df.columns:
        df = df.rename(columns={"CIP 2021 Categories to Include": "CIP Code"})
    elif "CIP202 Categories to Include" in df.columns:
        df = df.rename(columns={"CIP202 Categories to Include": "CIP Code"})
    elif "\"CIP202\nCategories\nto\nInclude\"" in df.columns:
        df = df.rename(columns={"\"CIP202\nCategories\nto\nInclude\"": "CIP Code"})
    if "Category (UM-Ann Arbor grants Bachelor's degrees; no undergraduate Diploma/Certificates or Associate degrees)" in df.columns:
        df = df.rename(columns={"Category (UM-Ann Arbor grants Bachelor's degrees; no undergraduate Diploma/Certificates or Associate degrees)": "Category"})
    if "Unnamed: 2" in df.columns:
        df = df.rename(columns={"Unnamed: 2": "Bachelor's"})
    if "Bachelor’s degrees (First majors)" in df.columns:
        df = df.drop(columns=["Bachelor’s degrees (First majors)"])
    return df
    
def merge_all_column_names(full_processed_csvs_folder):
    for year_folder in os.listdir(full_processed_csvs_folder):
        if year_folder == ".DS_Store" or year_folder == "combined":
            continue
        year_folder_path = os.path.join(full_processed_csvs_folder, year_folder)
        for college_file in os.listdir(year_folder_path):
            if college_file.endswith('.csv'):
                college_file_path = os.path.join(year_folder_path, college_file)
                df = merge_column_names(college_file_path)
                df.to_csv(college_file_path, index=False)

def combine_year_data(full_processed_csvs_folder):
    for year_folder in os.listdir(full_processed_csvs_folder):
        if year_folder == ".DS_Store" or year_folder == "combined":
            continue
        year_folder_path = os.path.join(full_processed_csvs_folder, year_folder)
        combined_df = pd.DataFrame()
        for college_file in os.listdir(year_folder_path):
            if college_file.endswith('.csv'):
                college_file_path = os.path.join(year_folder_path, college_file)
                df = pd.read_csv(college_file_path)
                df["college"] = college_file.split("_")[0]
                df["year"] = year_folder
                combined_df = pd.concat([combined_df, df], ignore_index=True)
        combined_folder = "cds_data/useful_csvs"
        if not os.path.exists(combined_folder):
            os.makedirs(combined_folder)
        combined_df.to_csv(os.path.join(combined_folder, f"{year_folder}_combined.csv"), index=False)

merge_all_column_names("cds_data/full_processed_csvs")
combine_year_data("cds_data/full_processed_csvs")

In [None]:
# Fix UMich data :///
for combined_csv in os.listdir("cds_data/useful_csvs"):
    df = pd.read_csv(os.path.join("cds_data/useful_csvs", combined_csv))
    df.replace("--", "", inplace=True)
    if "Bachelor's" in df.columns:
        df['Bachelor’s'] = df['Bachelor’s'].combine_first(df["Bachelor's"])
        df = df.drop(columns=["Bachelor's"])
    df.to_csv(os.path.join("cds_data/useful_csvs", combined_csv), index=False)

In [None]:
# Combine data from different years into a single dataset

def combine_years(useful_csv_path):
    combined_df = pd.DataFrame()
    for file in os.listdir(useful_csv_path):
        if file.endswith('.csv') and file != 'all_years.csv':
            file_path = os.path.join(useful_csv_path, file)
            df = pd.read_csv(file_path)
            combined_df = pd.concat([combined_df, df], ignore_index=True)
    combined_df.dropna(axis=1, how='all')
    column_order = ['year', 'college', 'CIP Code', 'Category', 'Bachelor’s', 'Associate', 'Diploma/Certificates']
    combined_df = combined_df[column_order]
    combined_df = combined_df.sort_values(by=['year', 'college'], ascending=[True, True], kind='mergesort')
    combined_df.to_csv('cds_data/useful_csvs/all_years.csv', index=False)

combine_years('cds_data/useful_csvs')

In [None]:
df = pd.read_csv('cds_data/useful_csvs/all_years.csv')
df.shape

# Ethics & Privacy

- Thoughtful discussion of ethical concerns included
- Ethical concerns consider the whole data science process (question asked, data collected, data being used, the bias in data, analysis, post-analysis, etc.)
- How your group handled bias/ethical concerns clearly described

Acknowledge and address any ethics & privacy related issues of your question(s), proposed dataset(s), and/or analyses. Use the information provided in lecture to guide your group discussion and thinking. If you need further guidance, check out [Deon's Ethics Checklist](http://deon.drivendata.org/#data-science-ethics-checklist). In particular:

- Are there any biases/privacy/terms of use issues with the data you propsed?
- Are there potential biases in your dataset(s), in terms of who it composes, and how it was collected, that may be problematic in terms of it allowing for equitable analysis? (For example, does your data exclude particular populations, or is it likely to reflect particular human biases in a way that could be a problem?)
- How will you set out to detect these specific biases before, during, and after/when communicating your analysis?
- Are there any other issues related to your topic area, data, and/or analyses that are potentially problematic in terms of data privacy and equitable impact?
- How will you handle issues you identified?

# Team Expectations 


Read over the [COGS108 Team Policies](https://github.com/COGS108/Projects/blob/master/COGS108_TeamPolicies.md) individually. Then, include your group’s expectations of one another for successful completion of your COGS108 project below. Discuss and agree on what all of your expectations are. Discuss how your team will communicate throughout the quarter and consider how you will communicate respectfully should conflicts arise. By including each member’s name above and by adding their name to the submission, you are indicating that you have read the COGS108 Team Policies, accept your team’s expectations below, and have every intention to fulfill them. These expectations are for your team’s use and benefit — they won’t be graded for their details.

* *Team Expectation 1*
* *Team Expectation 2*
* *Team Expecation 3*
* ...

# Project Timeline Proposal

Specify your team's specific project timeline. An example timeline has been provided. Changes the dates, times, names, and details to fit your group's plan.

If you think you will need any special resources or training outside what we have covered in COGS 108 to solve your problem, then your proposal should state these clearly. For example, if you have selected a problem that involves implementing multiple neural networks, please state this so we can make sure you know what you’re doing and so we can point you to resources you will need to implement your project. Note that you are not required to use outside methods.



| Meeting Date  | Meeting Time| Completed Before Meeting  | Discuss at Meeting |
|---|---|---|---|
| 1/20  |  1 PM | Read & Think about COGS 108 expectations; brainstorm topics/questions  | Determine best form of communication; Discuss and decide on final project topic; discuss hypothesis; begin background research | 
| 1/26  |  10 AM |  Do background research on topic | Discuss ideal dataset(s) and ethics; draft project proposal | 
| 2/1  | 10 AM  | Edit, finalize, and submit proposal; Search for datasets  | Discuss Wrangling and possible analytical approaches; Assign group members to lead each specific part   |
| 2/14  | 6 PM  | Import & Wrangle Data (Ant Man); EDA (Hulk) | Review/Edit wrangling/EDA; Discuss Analysis Plan   |
| 2/23  | 12 PM  | Finalize wrangling/EDA; Begin Analysis (Iron Man; Thor) | Discuss/edit Analysis; Complete project check-in |
| 3/13  | 12 PM  | Complete analysis; Draft results/conclusion/discussion (Wasp)| Discuss/edit full project |
| 3/20  | Before 11:59 PM  | NA | Turn in Final Project & Group Project Surveys |