# COGS 108 - Final Project (change this to your project's title)

# Permissions

Place an `X` in the appropriate bracket below to specify if you would like your group's project to be made available to the public. (Note that student names will be included (but PIDs will be scraped from any groups who include their PIDs).

* [ X ] YES - make available
* [  ] NO - keep private

# Name

- Kaylie Mendoza

# Abstract

Please write one to four paragraphs that describe a very brief overview of why you did this, how you did, and the major findings and conclusions.

# Research Question

-  Include a specific, clear data science question.
-  Make sure what you're measuring (variables) to answer the question is clear

What is your research question? Include the specific question you're setting out to answer. This question should be specific, answerable with data, and clear. A general question with specific subquestions is permitted. (1-2 sentences)



## Background and Prior Work


- Include a general introduction to your topic
- Include explanation of what work has been done previously
- Include citations or links to previous work

This section will present the background and context of your topic and question in a few paragraphs. Include a general introduction to your topic and then describe what information you currently know about the topic after doing your initial research. Include references to other projects who have asked similar questions or approached similar problems. Explain what others have learned in their projects.

Find some relevant prior work, and reference those sources, summarizing what each did and what they learned. Even if you think you have a totally novel question, find the most similar prior work that you can and discuss how it relates to your project.

References can be research publications, but they need not be. Blogs, GitHub repositories, company websites, etc., are all viable references if they are relevant to your project. It must be clear which information comes from which references. (2-3 paragraphs, including at least 2 references)

  **Use inline citation through HTML footnotes to specify which references support which statements** 

For example: After government genocide in the 20th century, real birds were replaced with surveillance drones designed to look just like birds.<a name="cite_ref-1"></a>[<sup>1</sup>](#cite_note-1) Use a minimum of 2 or 3 citations, but we prefer more.<a name="cite_ref-2"></a>[<sup>2</sup>](#cite_note-2) You need enough to fully explain and back up important facts. 

Note that if you click a footnote number in the paragraph above it will transport you to the proper entry in the footnotes list below.  And if you click the ^ in the footnote entry, it will return you to the place in the main text where the footnote is made.

To understand the HTML here, `<a name="#..."> </a>` is a tag that allows you produce a named reference for a given location.  Markdown has the construciton `[text with hyperlink](#named reference)` that will produce a clickable link that transports you the named reference.

1. <a name="cite_note-1"></a> [^](#cite_ref-1) Lorenz, T. (9 Dec 2021) Birds Aren’t Real, or Are They? Inside a Gen Z Conspiracy Theory. *The New York Times*. https://www.nytimes.com/2021/12/09/technology/birds-arent-real-gen-z-misinformation.html 
2. <a name="cite_note-2"></a> [^](#cite_ref-2) Also refs should be important to the background, not some randomly chosen vaguely related stuff. Include a web link if possible in refs as above.


# Hypothesis



- Include your team's hypothesis
- Ensure that this hypothesis is clear to readers
- Explain why you think this will be the outcome (what was your thinking?)

What is your main hypothesis/predictions about what the answer to your question is? Briefly explain your thinking. (2-3 sentences)

# Data

## Data overview

For each dataset include the following information
- Dataset #1
  - Dataset Name: congress-legislators/executive.yaml
  - Link to the dataset: https://github.com/unitedstates/congress-legislators/blob/main/executive.yaml
  - Number of observations: 80
  - Number of variables: 4

The `executive.yaml` dataset contains 80 observations and 4 variables, providing information about U.S. presidents and vice presidents. Key variables include `name` (details about the individual's name), `bio` (biographical details such as birthdate and gender), and `terms` (a list of terms served, including start and end dates, party affiliation, and how they assumed office). The dataset is in a nested YAML format, requiring preprocessing to flatten the structure into a tabular format. This preprocessing involved parsing nested fields, expanding term records, and calculating additional metrics such as term durations, ages, and zodiac signs for further analysis.

## U.S. Presidents and Vice Presidents Dataset

In [1]:
## YOUR CODE TO LOAD/CLEAN/TIDY/WRANGLE THE DATA GOES HERE
## FEEL FREE TO ADD MULTIPLE CELLS PER SECTION 

In [2]:
# Imports

# Data manipulation and analysis
import pandas as pd

# Data parsing
import yaml
import ast

# Date calculations
from dateutil.relativedelta import relativedelta

In [3]:
# Initial Data Loading and YAML to CSV Conversion

# Set up input and output paths
executive_yaml = 'congress-legislators/executive.yaml'
executive_csv = 'executive.csv'

# Load YAML and convert to DataFrame
with open(executive_yaml, 'r') as f:
    df = pd.DataFrame(yaml.safe_load(f))

# Save to CSV
df.to_csv(executive_csv, index=False)

FileNotFoundError: [Errno 2] No such file or directory: 'congress-legislators/executive.yaml'

In [None]:
# Load in the U.S. Presidents and Vice Presidents Dataset
executive = pd.read_csv('executive.csv')

In [None]:
# Parse the executive DataFrame

# 1. Extract name fields
names = pd.json_normalize(executive['name'].apply(eval))

# 2. Extract bio fields
bios = pd.json_normalize(executive['bio'].apply(eval))
bios = bios[['birthday', 'gender']]

# 3. Expand terms
def expand_term(row):
    terms = eval(row['terms'])
    expanded = pd.json_normalize(terms)
    expanded = expanded[['type', 'party', 'start', 'end', 'how']]
    
    # Add name and bio information
    for col in names.columns:
        expanded[col] = names.loc[row.name, col]
    for col in bios.columns:
        expanded[col] = bios.loc[row.name, col]
    
    return expanded

# Process each row's 'terms' column data into a single DataFrame
expanded_terms = [expand_term(row) for _, row in executive.iterrows()]
executive = pd.concat(expanded_terms, ignore_index=True)

# Remove duplicate columns and original nested data columns
columns_to_drop = ['id', 'name', 'bio', 'terms']
executive = executive.loc[:, ~executive.columns.duplicated()]
executive = executive.drop(columns=[col for col in columns_to_drop if col in executive.columns])

In [None]:
# Convert date strings to datetime.date objects

date_columns = {
    'birthday': 'birthdate',
    'start': 'start_term',
    'end': 'end_term'
}

for old_col, new_col in date_columns.items():
    executive[new_col] = pd.to_datetime(executive[old_col], errors='coerce').dt.date
    executive = executive.drop(columns=[old_col])

In [None]:
# Create full names by combining name components
executive['full_name'] = executive.apply(
    lambda row: ' '.join(filter(pd.notnull, [
        row['first'],
        f'"{row["nickname"]}"' if pd.notnull(row['nickname']) else None,
        row['middle'],
        row['last'],
        row['suffix']
    ])) or None,
    axis=1)

In [None]:
# Create a 'birthday' column from the 'birthdate' column by extracting month-day
executive['birthday'] = executive['birthdate'].apply(lambda x: x.strftime('%m-%d') if pd.notnull(x) else None)

In [None]:
# Define zodiac signs and their date ranges
zodiac_ranges = [
    ("Capricorn", [(12, 22, 12, 31), (1, 1, 1, 19)]),
    ("Aquarius", [(1, 20, 2, 18)]),
    ("Pisces", [(2, 19, 3, 20)]),
    ("Aries", [(3, 21, 4, 19)]),
    ("Taurus", [(4, 20, 5, 20)]),
    ("Gemini", [(5, 21, 6, 20)]),
    ("Cancer", [(6, 21, 7, 22)]),
    ("Leo", [(7, 23, 8, 22)]),
    ("Virgo", [(8, 23, 9, 22)]),
    ("Libra", [(9, 23, 10, 22)]),
    ("Scorpio", [(10, 23, 11, 21)]),
    ("Sagittarius", [(11, 22, 12, 21)])
]

# Get zodiac sign from birthday
def in_date_range(month, day, range_tuple):
    if len(range_tuple) == 4:  # Single range
        start_m, start_d, end_m, end_d = range_tuple
        return (month, day) >= (start_m, start_d) and (month, day) <= (end_m, end_d)
    else:  # Split range (for Capricorn)
        return any(in_date_range(month, day, r) for r in range_tuple)

executive['zodiac_sign'] = executive['birthday'].apply(
    lambda x: next(
        (sign for sign, ranges in zodiac_ranges 
         if x and any(in_date_range(*map(int, x.split('-')), r) for r in ranges)),
        None
    ) if x else None
)

In [None]:
# zodiac sign color mapping
# this dictionary maps each zodiac sign to a specific color (hex code).
# colors are chosen to represent traits commonly associated with each sign.

zodiac_colors = {
    'Aries': '#FF0000',       # red for Aries (bold and energetic)
    'Taurus': '#008000',      # green for Taurus (grounded and earthy)
    'Gemini': '#FFFF00',      # yellow for Gemini (bright and lively)
    'Cancer': '#00008B',      # dark blue for Cancer (deep and emotional)
    'Leo': '#FFA500',         # orange for Leo (warm and vibrant)
    'Virgo': '#A52A2A',       # brown for Virgo (practical and grounded)
    'Libra': '#FFB6C1',       # light pink for Libra (harmonious and gentle)
    'Scorpio': '#000000',     # black for Scorpio (mysterious and intense)
    'Sagittarius': '#800080', # purple for Sagittarius (adventurous and wise)
    'Capricorn': '#556B2F',   # olive green for Capricorn (disciplined and stable)
    'Aquarius': '#0000FF',    # blue for Aquarius (innovative and free-spirited)
    'Pisces': '#40E0D0',      # turquoise for Pisces (dreamy and intuitive)
    'Unknown': '#D3D3D3'      # light gray for unknown or missing zodiac signs
}

# Add a new column for zodiac colors
executive['zodiac_color'] = executive['zodiac_sign'].map(lambda x: zodiac_colors.get(x, zodiac_colors['Unknown']))

In [None]:
# zodiac elements mapping
zodiac_elements = {
    'Aries': 'Fire',
    'Taurus': 'Earth',
    'Gemini': 'Air',
    'Cancer': 'Water',
    'Leo': 'Fire',
    'Virgo': 'Earth',
    'Libra': 'Air',
    'Scorpio': 'Water',
    'Sagittarius': 'Fire',
    'Capricorn': 'Earth',
    'Aquarius': 'Air',
    'Pisces': 'Water'
}

# Apply the zodiac element mapping directly to the column
executive['zodiac_element'] = executive['zodiac_sign'].map(zodiac_elements)

In [None]:
# zodiac_element color mapping
# this dictionary maps each zodiac element to a specific color (hex code).
# colors are chosen to represent traits commonly associated with each element.
zodiac_element_colors = {
    'Fire': '#FF4500',   # orange-red for Fire (passionate and energetic)
    'Earth': '#8B4513',  # saddle brown for Earth (stable and grounded)
    'Air': '#87CEEB',    # sky blue for Air (light and free-spirited)
    'Water': '#4682B4'   # steel blue for Water (deep and emotional)
}

# Add a new column for zodiac element colors
executive['zodiac_element_color'] = executive['zodiac_element'].map(zodiac_element_colors)

In [None]:
# Calculate ages at the start and end of terms

def calculate_term_ages(row):
    try:
        return pd.Series({
            'age_start': max(relativedelta(row['start_term'], row['birthdate']).years, 0),
            'age_end': max(relativedelta(row['end_term'], row['birthdate']).years, 0)
        })
    except Exception as e:
        return pd.Series({'age_start': None, 'age_end': None})

# Apply the function and combine with DataFrame
executive = pd.concat([executive, executive.apply(calculate_term_ages, axis=1)], axis=1)

In [None]:
# Calculate term durations and format them

# Calculate duration metrics (days, years, formatted strings) for each term
def calculate_term_duration(row):
    try:
        # Calculate both timedelta and relativedelta for different metrics
        delta = relativedelta(row['end_term'], row['start_term'])
        days = (row['end_term'] - row['start_term']).days

        # Calculate actual leap years in the term period
        def is_leap_year(year):
            return year % 4 == 0 and (year % 100 != 0 or year % 400 == 0)
        
        # Count actual leap years in the period
        start_year = row['start_term'].year
        end_year = row['end_term'].year
        leap_years = sum(1 for year in range(start_year, end_year + 1) 
                        if is_leap_year(year))
        
        # Calculate exact years considering actual leap years
        total_years = days / 365  # Start with regular years
        if leap_years > 0:
            # Adjust years calculation based on actual leap years
            regular_days = days - leap_years
            total_years = (regular_days / 365) + (leap_years / 366)
        
        return pd.Series({
            'duration_days': days,
            'total_duration_years': round(total_years, 2),
            'duration_years_months': f"{delta.years} year{'s' if delta.years != 1 else ''}, {delta.months} month{'s' if delta.months != 1 else ''}",
            'duration_years_months_days': f"{delta.years} year{'s' if delta.years != 1 else ''}, {delta.months} month{'s' if delta.months != 1 else ''}, {delta.days} day{'s' if delta.days != 1 else ''}"
        })
    except Exception as e:
        print(f"Error calculating duration: {e}")
        return pd.Series({k: None for k in ['duration_days', 'total_duration_years', 'duration_years_months', 'duration_years_months_days']})

executive = pd.concat([executive, executive.apply(calculate_term_duration, axis=1)], axis=1)

In [None]:
columns_to_keep = [
    'full_name', 'gender', 'birthdate', 'birthday', 'zodiac_sign', 'zodiac_color', 'zodiac_element', 'zodiac_element_color',
    'type', 'party', 'start_term', 'end_term', 'age_start', 'age_end', 'duration_years_months_days', 'total_duration_years'
 ]

# Create a new DataFrame with only the specified columns
cleaned_executive = executive[columns_to_keep]

In [None]:
# Save the processed DataFrame
cleaned_executive.to_csv('cleaned_executive.csv', index=False)

# Results

## Exploratory Data Analysis

Carry out whatever EDA you need to for your project.  Because every project will be different we can't really give you much of a template at this point. But please make sure you describe the what and why in text here as well as providing interpretation of results and context.

## First Analysis You Did - Give it a better title

Some more words and stuff.  Remember notebooks work best if you interleave the code that generates a result with properly annotate figures and text that puts these results into context.

In [None]:
## YOUR CODE HERE
## FEEL FREE TO ADD MULTIPLE CELLS PER SECTION

## Second Analysis You Did - Give it a better title

Some more words and stuff.  Remember notebooks work best if you interleave the code that generates a result with properly annotate figures and text that puts these results into context.

In [None]:
## YOUR CODE HERE
## FEEL FREE TO ADD MULTIPLE CELLS PER SECTION

## ETC AD NASEUM

Some more words and stuff.  Remember notebooks work best if you interleave the code that generates a result with properly annotate figures and text that puts these results into context.

In [None]:
## YOUR CODE HERE
## FEEL FREE TO ADD MULTIPLE CELLS PER SECTION

# Ethics & Privacy

- Thoughtful discussion of ethical concerns included
- Ethical concerns consider the whole data science process (question asked, data collected, data being used, the bias in data, analysis, post-analysis, etc.)
- How your group handled bias/ethical concerns clearly described

Acknowledge and address any ethics & privacy related issues of your question(s), proposed dataset(s), and/or analyses. Use the information provided in lecture to guide your group discussion and thinking. If you need further guidance, check out [Deon's Ethics Checklist](http://deon.drivendata.org/#data-science-ethics-checklist). In particular:

- Are there any biases/privacy/terms of use issues with the data you propsed?
- Are there potential biases in your dataset(s), in terms of who it composes, and how it was collected, that may be problematic in terms of it allowing for equitable analysis? (For example, does your data exclude particular populations, or is it likely to reflect particular human biases in a way that could be a problem?)
- How will you set out to detect these specific biases before, during, and after/when communicating your analysis?
- Are there any other issues related to your topic area, data, and/or analyses that are potentially problematic in terms of data privacy and equitable impact?
- How will you handle issues you identified?

# Discusison and Conclusion

Wrap it all up here.  Somewhere between 3 and 10 paragraphs roughly.  A good time to refer back to your Background section and review how this work extended the previous stuff. 


# Team Contributions

Speficy who did what.  This should be pretty granular, perhaps bullet points, no more than a few sentences per person.