# DS 5220 - AI-Assisted Programming in Python
> Homework 1: Python Review

**Due: Wednesday, September 13 at 11:59pm**

## Instructions
This homework is designed to assess your Python skills and your understanding of how these skills apply to data science problems. Each question generally includes multiple parts:

* **Part A**: Solve the problem using basic Python data structures.  
* **Part B**: Solve the problem using pandas.  
* **Part C**: Reflect on the solutions, considering its design aspects, applicability, usability, maintainability, readability, etc. 
* **Part D**: Generate an alternative approach to the problem using a generative AI tool. Describe or copy/paste this approach and describe pros/cons of their approach relative to yours. If you need help thinking through the pros/cons, you can directly ask the generative AI about differences between your code/its code.

## Use of Generative AI
First, try letters A, B, and C of each question on your own, with no assistance from generative AI. If needed, you can _**then**_ use generative AI on these items _**but only as a tutor, and it should not directly tell you the answer. It should guide you on a journey for you to find the answer**._ Part D of each question explicitly requires generative AI. 

***

## Problem 1: Data Cleaning

* **Part A**: You have a list of dictionaries, each representing a data record with name, age, and email. Write a function that takes this list and returns a new list where:

  * Records with an invalid age (non-numeric or outside the range of 0-120) or missing email are removed.
  * Names are standardized to have the first letter capitalized.

* **Part B**: Convert the list of dictionaries to a pandas DataFrame and perform the same operation using pandas functions.

* **Part C**: How would you extend this functionality to validate email formats as well?

* **Part D**: (Follow the general Part D instructions)



In [54]:
# Synthetic dataset
sample_data = [
    {"name": "alice", "age": 30, "email": "alice@email.com"},
    {"name": "bob", "age": "not_an_age", "email": "bob@email.com"},
    {"name": "carol", "age": 150, "email": ""},
    {"name": "dave", "age": 40, "email": "dave@email.com"}
]

### Part A

In [55]:
# Your answers
valid_data = []
for x in sample_data:
    #print (x)
    if isinstance(x["age"],int) and 0<(x["age"])<120:
        if isinstance(x["email"],str):
            x['name'] = x['name'].capitalize()
            #print(x)
            valid_data.append(x)
        
print(valid_data)

[{'name': 'Alice', 'age': 30, 'email': 'alice@email.com'}, {'name': 'Dave', 'age': 40, 'email': 'dave@email.com'}]


### Part B

In [56]:
# Your answers
import pandas as pd
df = pd.DataFrame(sample_data)

df.set_index('name',inplace=True)

age_valid = df['age'].apply(lambda x: True if isinstance(x, int) and 0<= x<=120 else False)
email_valid = df['email'].apply(lambda x: True if x!='' else False)   
filtered_data = df[age_valid & email_valid]
filtered_data.set_index(filtered_data.index.str.capitalize(),inplace=True)
print(filtered_data)

      age            email
name                      
Alice  30  alice@email.com
Dave   40   dave@email.com


### Part C
I wil use a library called email-validator and use the function: validate_email, and EmailNotValidError. This can validate the email and return the normalized format.

### Part D
Your comments here

In [61]:
# Any necessary code here
# AI's code:
# Part A
sample_data = [
    {"name": "alice", "age": 30, "email": "alice@email.com"},
    {"name": "bob", "age": "not_an_age", "email": "bob@email.com"},
    {"name": "carol", "age": 150, "email": ""},
    {"name": "dave", "age": 40, "email": "dave@email.com"}
]

def clean_data(data):
    cleaned = []
    for record in data:
        age = record.get("age")
        email = record.get("email")
        name = record.get("name").capitalize()
        if isinstance(age, (int, float)) and 0 <= age <= 120 and email:
            cleaned.append({"name": name, "age": age, "email": email})
    return cleaned

cleaned_data = clean_data(sample_data)
print(cleaned_data)
#Pros of AI's Approach:
#Safety: The AI's solution uses .get() which would return None if a key is missing, preventing potential KeyErrors.
#Generality: The AI checks for both integer and float ages.

#Cons of AI's Approach:
#Additional Function: The AI introduces an extra function, which might be seen as unnecessary overhead for a simple task. Your direct approach without a function is more straightforward.

# Part B 
import pandas as pd

df = pd.DataFrame(sample_data)

# Remove rows with invalid ages
df = df[df['age'].apply(lambda x: isinstance(x, (int, float)) and 0 <= x <= 120)]

# Remove rows with empty email
df = df[df['email'] != ""]

# Standardize names
df['name'] = df['name'].str.capitalize()

print(df)

#Comparison:
#Pros of AI's Approach:
#Flexibility: The AI doesn't set the name as the DataFrame's index. This retains the original structure of the DataFrame and allows for more flexibility in further data operations.
#Efficiency: The AI's approach employs pandas' built-in functions in a more chained manner. This can be more efficient, especially with larger datasets.

#Cons of AI's Approach:
#Clarity: The AI's method of chaining operations may be less readable to some, especially those not deeply familiar with pandas. Your approach with separate validity masks is more verbose but might be clearer to some readers.
#No Index Usage: By not using indices (like you did with names), the AI's solution might not leverage potential benefits of pandas indexing, such as faster lookups. However, setting names as indices might not always be desirable, especially if names aren't unique.

[{'name': 'Alice', 'age': 30, 'email': 'alice@email.com'}, {'name': 'Dave', 'age': 40, 'email': 'dave@email.com'}]
    name age            email
0  Alice  30  alice@email.com
3   Dave  40   dave@email.com
[{'name': 'Alice', 'age': 30, 'email': 'alice@email.com'}, {'name': 'Dave', 'age': 40, 'email': 'dave@email.com'}]
    name age            email
0  Alice  30  alice@email.com
3   Dave  40   dave@email.com


## Problem 2: Advanced List Slicing with Named Indices
* **Part A**: You are given a list of tuples, where each tuple contains a restaurant name and its average meal cost (in dollars). Write a function that returns another list containing the average meal costs of restaurants whose names contain the word "Chicken".

* **Part B**: Convert the list of tuples into a pandas Series where the index is the restaurant name. Use pandas to perform the slicing based on restaurant names containing the word "Chicken".

* **Part C**: Reflect on how the transition from implicit indices (in lists) to explicit, named indices (in pandas Series) affects your data manipulation capabilities.

* **Part D**: (Follow the general Part D instructions)

In [63]:
# Synthetic dataset
restaurant_data = [("Chicken Palace", 25),
                   ("Hattie B's Hot Chicken", 15),
                   ("Vegan Corner", 20),
                   ("The Wild Cow", 14),
                   ("Prince's Chicken", 18),
                   ("Amerigos Italian Restaurant", 30)]



### Part A

In [7]:
# Your answers
new_list_cost_chicken = []
for index in range(len(restaurant_data)):
    if "Chicken" in restaurant_data [index][0]:
        new_list_cost_chicken.append(restaurant_data[index][1])
print(new_list_cost_chicken)



[25, 15, 18]


### Part B

In [16]:
# Your answers
import pandas as pd
list_a = []
for el in range(len(restaurant_data)):
    list_a.append(restaurant_data[el][0])

list_b = []
for el in range(len(restaurant_data)):
    list_b.append(restaurant_data[el][1])

series = pd.Series (list_b, list_a)
chicken_or_not= series.index.str.contains('Chicken')
filtered_series = series[chicken_or_not]
print(filtered_series)

Chicken Palace            25
Hattie B's Hot Chicken    15
Prince's Chicken          18
dtype: int64


### Part C
I think the explicit indices are more reader friendly since they can reduce the cognitive load for analysists to filter and analyze data, thus reducing the possibilities of errors.  

### Part D
Your comments here

In [66]:
# AI- part A
def avg_cost_chicken(restaurants):
    return [cost for name, cost in restaurants if "Chicken" in name]

chicken_costs = avg_cost_chicken(restaurant_data)
print(chicken_costs)

#Pros of AI's Approach Over Yours:
#Conciseness: AI's approach uses list comprehension, which is more concise and typically faster.
#Direct Tuple Unpacking: The AI's code directly unpacks the tuple into name and cost, making it clearer which part of the data is being checked and appended.

#Cons of AI's Approach Compared to Yours:
#Readability: List comprehensions, while concise, may be less readable to some developers compared to traditional for-loops.


# AI- part B
import pandas as pd

# Convert the list of tuples to a pandas Series
restaurant_series = pd.Series(dict(restaurant_data))

# Slice the series based on index names containing the word "Chicken"
chicken_costs_series = restaurant_series[restaurant_series.index.str.contains("Chicken")]

print(chicken_costs_series)

#Pros of AI's Approach Over Yours:
#Efficiency: The AI's method avoids creating two separate lists and uses the existing structure of the restaurant_data for Series conversion. This is more memory efficient.
#Conciseness: By chaining operations and utilizing pandas functionalities, the AI's code is more concise.
#Simpler Series Creation: The AI directly creates a pandas Series from the list of tuples, leveraging the dictionary-like structure of the data. Your approach of splitting the data first before creating the Series is an additional step.

#Cons of AI's Approach Compared to Yours:
#Potential Complexity: For those unfamiliar with pandas, the AI's chained operations might seem complex. The verbosity of your code could be beneficial for someone new to the library.

[25, 15, 18]
Chicken Palace            25
Hattie B's Hot Chicken    15
Prince's Chicken          18
dtype: int64


## Problem 3: Weather Data Aggregator
* **Part A**: You are given a list of dictionaries, where each dictionary represents weather data for a given city on a specific day. Aggregate this data to find the average, maximum, and minimum temperatures for each city.
* **Part B**: Convert the list of dictionaries to a pandas DataFrame and perform the same operation using pandas functionalities.
* **Part C**: How could you modify your program to handle missing data?
* **Part D**: (Follow the general Part D instructions)



In [68]:
# Synthetic dataset
weather_data = {
    'New York': [72, 75, 71, 73, 69, 74, 76, 70, 77, 68, 71, 73, 74, 72, 70, 71, 75, 73, 74, 76, 70, 77, 68, 72, 71, 73, 74, 76, 75, 70],
    'San Francisco': [62, 63, 65, 64, 61, 63, 65, 60, 62, 61, 64, 66, 63, 64, 65, 63, 61, 62, 66, 64, 63, 62, 64, 66, 61, 62, 65, 64, 63, 61],
    'Chicago': [68, 70, 67, 66, 71, 69, 68, 70, 67, 72, 70, 68, 69, 71, 70, 69, 68, 66, 71, 67, 69, 70, 68, 72, 71, 70, 67, 69, 68, 71],
    'Miami': [85, 86, 84, 83, 85, 84, 86, 85, 87, 83, 84, 85, 86, 87, 84, 83, 85, 86, 84, 87, 85, 86, 84, 85, 87, 86, 85, 84, 83, 87]
}


### Part A

In [40]:
# Your answers
import statistics 
for keys in weather_data:
    print(f"Average temperature for {keys} is {statistics.mean(weather_data[keys])}.") #average 
    print(f"Maximum temperature for {keys} is {max(weather_data[keys])}.") #max
    print(f"Minimum temperature for {keys} is {min(weather_data[keys])}.")#min
   


Average temperature for New York is 72.66666666666667.
Maximum temperature for New York is 77.
Minimum temperature for New York is 68.
Average temperature for San Francisco is 63.166666666666664.
Maximum temperature for San Francisco is 66.
Minimum temperature for San Francisco is 60.
Average temperature for Chicago is 69.06666666666666.
Maximum temperature for Chicago is 72.
Minimum temperature for Chicago is 66.
Average temperature for Miami is 85.03333333333333.
Maximum temperature for Miami is 87.
Minimum temperature for Miami is 83.


### Part B

In [51]:
# Your answers
import pandas as py
df = py.DataFrame(weather_data).T
#print(df)
# list(weather_data.keys())
# list(weather_data.keys())[0]- the first key 
for n in range (df.shape[0]):
    print(f"Maximum temperatue for {list(weather_data.keys())[n]} is {df.iloc[n].max()}.") # max
    print(f"Minimum temperatue for {list(weather_data.keys())[n]} is {df.iloc[n].min()}.") # min
    print(f"Average temperatue for {list(weather_data.keys())[n]} is {df.iloc[n].mean()}.") 

Maximum temperatue for New York is 77.
Minimum temperatue for New York is 68.
Average temperatue for New York is 72.66666666666667.
Maximum temperatue for San Francisco is 66.
Minimum temperatue for San Francisco is 60.
Average temperatue for San Francisco is 63.166666666666664.
Maximum temperatue for Chicago is 72.
Minimum temperatue for Chicago is 66.
Average temperatue for Chicago is 69.06666666666666.
Maximum temperatue for Miami is 87.
Minimum temperatue for Miami is 83.
Average temperatue for Miami is 85.03333333333333.


### Part C
I will put the average number of that city to replace any of its empty data. 

### Part D
Your comments here

In [70]:
# AI for Part A
def aggregate_weather_data(data):
    aggregated_data = {}

    for city, temps in data.items():
        avg_temp = sum(temps) / len(temps)
        max_temp = max(temps)
        min_temp = min(temps)

        aggregated_data[city] = {
            "Average": avg_temp,
            "Maximum": max_temp,
            "Minimum": min_temp
        }

    return aggregated_data

weather_summary = aggregate_weather_data(weather_data)
for city, summary in weather_summary.items():
    print(city, summary)
    
# pros:
# Modularity: The AI's code defines a function (aggregate_weather_data) which returns aggregated data. This makes it reusable and keeps the global scope clean.
# Custom Aggregation Storage: The AI's method stores aggregated results in a dictionary, making it accessible for future operations if necessary

# cons
# Verbosity: The AI's solution might appear more verbose due to the function definition and extra dictionary storage.
# Direct Calculation: While both methods work efficiently, using statistics.mean() might seem more readable to some.

# AI for Part B
import pandas as pd

df = pd.DataFrame(weather_data)

# Use built-in aggregation methods to find average, max, and min temperatures
aggregated_df = df.agg(["mean", "max", "min"]).transpose()

print(aggregated_df)

#pros:
#Efficiency: The AI's method leverages pandas' built-in aggregation functions directly on the DataFrame, which are optimized for performance and conciseness.
#Conciseness: The AI’s code is more concise since it avoids looping over rows. Instead, it applies aggregation directly.
#Intuitive Structure: Keeping cities as columns can be more intuitive since each city's data is treated as a feature.

#cons:
#Direct Printing: Your method prints the aggregated results directly while iterating, which might seem more direct and less abstracted than the AI's approach.

New York {'Average': 72.66666666666667, 'Maximum': 77, 'Minimum': 68}
San Francisco {'Average': 63.166666666666664, 'Maximum': 66, 'Minimum': 60}
Chicago {'Average': 69.06666666666666, 'Maximum': 72, 'Minimum': 66}
Miami {'Average': 85.03333333333333, 'Maximum': 87, 'Minimum': 83}
                    mean   max   min
New York       72.666667  77.0  68.0
San Francisco  63.166667  66.0  60.0
Chicago        69.066667  72.0  66.0
Miami          85.033333  87.0  83.0


## Problem 4: Class Enrollment
* **Part A**: You are given a list of student names and the courses they've enrolled in. Calculate the number of students enrolled in each course.

* **Part B**: Convert the list of student-course pairs to a pandas DataFrame and solve the problem using pandas functionalities.

* **Part C**: How could the program be extended to also handle scheduling conflicts for enrolled students?

* **Part D**: (Follow the general Part D instructions)

In [56]:
# Synthetic dataset
enrollments = [("Alice", "Math"), ("Bob", "Physics"),
               ("Alice", "Physics"), ("Carol", "Math"),
               ("Dave", "Biology")]


### Part A

In [62]:
# Your answers
course_counts = {}
for _,course in enrollments:
    #print(course)
    if course in course_counts:
        course_counts[course] +=1
    else:
        course_counts [course]=1
print(course_counts)
    

{'Math': 2, 'Physics': 2, 'Biology': 1}


### Part B

In [88]:
# Your answers
import pandas as py
df = py.DataFrame(enrollments)
print(df)
df.groupby(df[1]).count()


       0        1
0  Alice     Math
1    Bob  Physics
2  Alice  Physics
3  Carol     Math
4   Dave  Biology


Unnamed: 0_level_0,0
1,Unnamed: 1_level_1
Biology,1
Math,2
Physics,2


### Part C
I can input the schedule for every student's enrolled class. Then I can group the students and check whether they have same schedules for their classes. 

### Part D
Your comments here

In [72]:
# AI for part A
enrollments = [("Alice", "Math"), ("Bob", "Physics"),
               ("Alice", "Physics"), ("Carol", "Math"),
               ("Dave", "Biology")]

course_counts = {}

for _, course in enrollments:
    if course in course_counts:
        course_counts[course] += 1
    else:
        course_counts[course] = 1

print(course_counts)

#Pros
#Minimal Difference: In reality, there isn't much of a difference in the logic of the two approaches for Part A. Both solutions use a dictionary to aggregate the counts, making them equally efficient.

#Cons
#Redundancy: There isn't any distinct advantage of ChatGPT's code over yours for Part A since the solutions are nearly identical.

# AI for part B
import pandas as pd

# Convert to DataFrame
df = pd.DataFrame(enrollments, columns=["Student", "Course"])

# Group by course and count
enrollment_counts = df.groupby("Course").size()

print(enrollment_counts)

#Pros
#Column Naming: ChatGPT's approach provides column names ("Student" and "Course") during DataFrame creation, which makes the DataFrame more readable and makes the subsequent code more self-explanatory.
#Standard Alias: The use of pd as an alias for pandas is more conventional in the data science and analytics community.

#Cons 
#Explicitness: If someone isn't familiar with pandas conventions, they might find your approach of using column indices (e.g., df[1]) more explicit in indicating that operations are based on the second column of the DataFrame.

{'Math': 2, 'Physics': 2, 'Biology': 1}
Course
Biology    1
Math       2
Physics    2
dtype: int64


## Problem 5: File Search
Your task is to write a Python function that takes a directory path and a file extension as input and returns a list of all files with the given extension located within the directory or any of its sub-directories. Use Python's `os` module to accomplish this.

Below is an example file tree; however, you can use a custom-created (or already existing) filetree on your own computer. The files of interest should be 2-3 directories deep as shown below.

```
/
|-- home/
|   |-- user/
|   |   |-- documents/
|   |   |   |-- file1.txt
|   |   |   `-- file2.pdf
|   |   `-- downloads/
|   |       `-- music/
|   |           |-- song1.mp3
|   |           `-- song2.mp3
`-- var/
    `-- tmp/
        |-- file3.txt
        `-- file4.pdf

```

* **Part A**: Using only Python's built-in os module, write your function. Assume that the directory tree is represented as a list of strings following a specific structure.
* **Part B**: Perform the same operation using a different Python package (e.g., `glob`) or a different approach (e.g., recursion).
* **Part D**: Use generative AI to come up with a different approach for solving this problem, and evaluate your two solutions against the AI's. Note down any insights or advantages you discover in the process.

### Part A

In [85]:
# Your answers
import os

def scan_files(directory, extension):
    root = os.path.abspath(directory)
    
    list_sub_files = []
    
    for root, dirs, files in os.walk(root,topdown=True):
        for file in files:
            if file.endswith(extension):
                list_sub_files.append(os.path.join(root,file))
                    
    print(list_sub_files)
  
def main():
    directory = "/Users/chenhanlin/Documents/vandy 23 fall/AI-programming/Assignment"
    scan_files(directory, '.pdf')
    
if __name__ == "__main__":
    main()

[]


### Part B

In [84]:
# Your answers
import os
import glob

def scan_files(directory, extension):
    list_sub_files=[]
    for filename in glob.iglob(os.path.join(directory,'**',('*'+ extension)),recursive = True):
       list_sub_files.append(filename)
    print(list_sub_files)
        

def main():
    directory = "/Users/chenhanlin/Documents/vandy 23 fall/AI-programming/Assignment"
    scan_files(directory, '.pdf')
    
if __name__ == "__main__":
    main()

[]


### Part D
Your comments here

In [None]:
# AI for Part A
import os

def find_files_with_extension(directory, extension):
    matching_files = []

    for dirpath, dirnames, filenames in os.walk(directory):
        for filename in filenames:
            if filename.endswith(extension):
                matching_files.append(os.path.join(dirpath, filename))

    return matching_files

#For the above example file tree, if you want to get all .txt files:
print(find_files_with_extension('/path_to_root_directory', '.txt'))

'''Pros:
Clear, straightforward logic using os.walk().
Uses a list comprehension approach, which can be more concise for this type of operation.
Returns the list of files, offering more flexibility and allowing for potential additional processing or usage.

Cons:
Does not use os.path.abspath(directory), so if a relative path is provided, it depends on the current working directory.'''

# AI for Part B
import glob

def find_files_with_glob(directory, extension):
    # ** will search in all directories and subdirectories
    return glob.glob(f"{directory}/**/*{extension}", recursive=True)

print(find_files_with_glob('/path_to_root_directory', '.txt'))
# ['/path_to_root_directory/home/user/documents/file1.txt', '/path_to_root_directory/var/tmp/file3.txt']
'''Pros:
Efficient usage of glob.glob() to search for files with the desired extension.
Returns the list of files, offering more flexibility.

Cons:
Does not directly handle edge cases, such as if an incorrect path is provided.'''

## Problem 6: Time-Series Data Transformation for Plotting
* **Part A**: You are given two lists: one containing timestamps and another containing corresponding temperature readings. Generate a list of (timestamp, temperature) pairs.

* **Part B**: Convert the two lists into a single pandas DataFrame and perform the same operation using pandas functionalities.

* **Part C**: Suppose you had to extend the functionality to also annotate significant events in the data, how might you incorporate this into your existing solutions?

* **Part D**: (Follow the general Part D instructions)

In [74]:
# Synthetic dataset
timestamps = [0, 3600, 7200, 4800]
temperatures = [20, 21, 19, 23]

### Part A

In [73]:
# Your answers
#del list if needed 
print(list(zip(timestamps,temperatures)))


NameError: name 'timestamps' is not defined

### Part B

In [134]:
# Your answers
import pandas as pd
df=pd.DataFrame({'timestamps':timestamps,'temperatures':temperatures})
df.set_index('timestamps',inplace=True)
print(df)

            temperatures
timestamps              
0                     20
3600                  21
7200                  19
4800                  23


### Part C
I will add another column which is called significance. I will label one if the record is significant. So I can count how many records are of significance. I can also see when these significant events happen. 

### Part D
Your comments here

In [75]:
# AI for part A
data_pairs = list(zip(timestamps, temperatures))
print(data_pairs)

#pros
#Indexing: By setting timestamps as the index, you have made time-series analysis easier. Pandas is specifically designed to handle time-series data efficiently when timestamps are set as indices.
#Data Visualization: If you wanted to plot the data using tools that recognize pandas DataFrames (like Seaborn), your indexed DataFrame would be more immediately ready for many types of plots.

#cons
#Flexibility: Since you set the timestamps as indices, any further data manipulations would need to be mindful of the indexed structure, which can be restrictive for certain operations.
#Data Addition: If you wanted to add more columns to the DataFrame later, you would have to reset the index or work with multi-index structures.

# AI for part B
import pandas as pd
df = pd.DataFrame({'Timestamp': timestamps, 'Temperature': temperatures})
print(df)

#pros
#Simplicity: The AI's approach keeps the data in a simple tabular form, which might be more intuitive for users unfamiliar with time-series data in pandas.
#General Usability: For general tasks that don't require time-series functions, having timestamps as a regular column can sometimes be more straightforward.

#cons
#Time-Series Analysis: Without setting the timestamps as an index, leveraging some of pandas' powerful time-series functions would require an extra step.

[(0, 20), (3600, 21), (7200, 19), (4800, 23)]
   Timestamp  Temperature
0          0           20
1       3600           21
2       7200           19
3       4800           23
