# **`Giuseppe Schintu - M.1 Tasks & Answers`**

# **`M.1 Manipulate a CSV File`**


## **`exercise.M.1`** - Python Warmup

### **`Overview and Directions`**

### **`Task.1`**  - comma-separated values (.csv)

Reading and parsing [delimiter-separated values](https://en.wikipedia.org/wiki/Delimiter-separated_values) files like [comma-separated](https://en.wikipedia.org/wiki/Comma-separated_values) and [tab-separated values](https://en.wikipedia.org/wiki/Tab-separated_values) is a regular data science preprocessing activity. It is typically acceptable to request either file format for analysis activities.    
- *.csv* files store tabular data like numbers and text in a plain text format. 
- Plain text may include text, white spaces, carriage returns, transliterals, and other artifacts.    
- Each row, or data record, contains a value or nothing. A comma separates each.    

**`Tasks`**  
0. Read in the Nobel prize winners name and age data: [data.M.1.exercise.csv](https://github.com/cosc-526/home.page/blob/main/data.M.1.exercise.csv)  
=> data is in class github. Read however you like!  
1. Generate a single value for the total number of rows of data.
2. Generate a single value for the total number of columns of data.  
3. Calculate the laureates average age as a datatype float.  
4. Solution structured as a user defined function (def) but doing so not required.   
5. hint  
.> use library `import requests` to read numerics from a url  
=> mydata = requests.get(file_url)  
==> if mydata.status_code == 200:  #200 = code for a successful request  
====> do something with lines

**`Useful links`**  
- [Ch.16, Importing Data, Python.Crash.Course, Matthes](https://github.com/cosc-526/cosc.526.home.page/blob/main/textbook.Python.crash.course.matthes.pdf)  
[open](https://docs.python.org/3.6/library/functions.html#open), 
[readlines](https://docs.python.org/3.6/library/codecs.html#codecs.StreamReader.readlines), [rstrip](https://docs.python.org/3.6/library/stdtypes.html#str.rstrip), [list comprehension](https://docs.python.org/3/tutorial/datastructures.html#list-comprehensions), [split](https://docs.python.org/3.6/library/stdtypes.html#str.split), [splice](https://docs.python.org/3.6/glossary.html#term-slice), ["list.love"](https://docs.python.org/3.6/tutorial/datastructures.html#more-on-lists), [len](https://docs.python.org/3.6/library/functions.html#len), [int](https://docs.python.org/3.6/library/functions.html#int), [format](https://docs.python.org/3.6/library/stdtypes.html#str.format)

In [9]:
import requests
import csv

def process_nobel_data():
    # Fetch the CSV file from the GitHub link
    file_url = "https://raw.githubusercontent.com/cosc-526/home.page/main/data.M.1.exercise.csv"
    response = requests.get(file_url)
    
    if response.status_code == 200:
        # Read the CSV data and calculate required values
        data = response.text.splitlines()
        csv_reader = csv.reader(data)
        header = next(csv_reader)  # Skip the header row
        
        # Task 1: Generate a single value for the total number of rows of data
        total_rows = sum(1 for _ in csv_reader)
        
        # Reset the reader back to the start
        csv_reader = csv.reader(data)
        next(csv_reader)  # Skip the header row again
        
        # Task 2: Generate a single value for the total number of columns of data
        total_columns = len(header)
        
        # Task 3: Calculate the laureates average age as a datatype float
        age_sum = 0
        for row in csv_reader:
            age = float(row[1])  # age is in the second column (index 1)
            age_sum += age
        average_age = age_sum / total_rows
        
        return total_rows, total_columns, average_age
    
    else:
        print("Error: Failed to retrieve the CSV file.")

# Call the function and store the results
rows, columns, avg_age = process_nobel_data()

# Print the results
print("Number of rows of data:", rows)
print("Number of cols:", columns)
print("Average Age:", avg_age)


Number of rows of data: 8
Number of cols: 3
Average Age: 70.875


**Task.1 Expected ouput**
```
Number of rows of data: 8
Number of cols: 3
Average Age: 70.875
```

### **`Task.3`** - Convert diacritics (ä, ö) to ASCII

- Download [data.M.1.exercise.csv](https://github.com/cosc-526/home.page/blob/main/data.M.1.exercise.csv) and right click on the file to view in Notepad.   
=> Observe the Unicode non-English letters in laureates' names like the two dots over the letter "o" in "Schrödinger."
- Learn about [Unicode](https://en.wikipedia.org/wiki/Unicode) character standards for representing different types and forms of text.  
- Grok that Python 3 [natively supports](https://docs.python.org/3/howto/unicode.html) Unicode, but many tools don't.
- Conversion of Unicode to [ASCII](https://en.wikipedia.org/wiki/ASCII) formatting is often necessary in data preprocessing.  

**Tasks**
0. Read this article on diacritics conversion (e.g., "ü" → "ue"); [transliteration](https://german.stackexchange.com/questions/4992/conversion-table-for-diacritics-e-g-%C3%BC-%E2%86%92-ue).  
1. data = [data.M.1.exercise.csv](https://github.com/cosc-526/home.page/blob/main/data.M.1.exercise.csv)  
=> provided example reads directly from github
2. Analyze and run code block with a dictionary matching Unicode character "keys" to their ASCII transliteration "value."
=> as a refresher, a dictionary is defined as mydict = { key:value }
3. For labeled code sections #3.1 to 3.9, explain succinctly what the code is accomplishing and whether you are or are not familiar with it.  
4. Create your inventory mechanism to store this, and more, code blocks.  

***More useful links***
- [1: replace](https://docs.python.org/3.6/library/stdtypes.html#str.replace), [2: file object methods](https://docs.python.org/3/tutorial/inputoutput.html#methods-of-file-objects),  


In [10]:
import requests

translit_dict = {
    "ä" : "ae",
    "ö" : "oe",
    "ü" : "ue",
    "Ä" : "Ae",
    "Ö" : "Oe",
    "Ü" : "Ue", 
    "ł" : "l",
    "ō" : "o",
}
#3.0
#read data from a URL
def parse_delimited_file(file_url, delimiter):
    response = requests.get(file_url)
    if response.status_code == 200:
        lines = response.text.split('\n')
    else:
        print('Failed to fetch the file from GitHub.')
        return
    lines = [line.rstrip('\n') for line in lines if line.strip()]  # Skip empty lines
    return lines

file_url = "https://raw.githubusercontent.com/cosc-526/home.page/main/data.M.1.exercise.csv"
lines = parse_delimited_file(file_url, delimiter=",")

#3.1
#with open("data.exercise.M.1.csv", 'r', encoding='utf8') as csvfile:
#    lines = csvfile.readlines()
#3.2
# Strip off the newline from the end of each line
lines = [line.rstrip() for line in lines]

#3.3   
# Split each line based on the delimiter (which, in this case, is the comma)
split_lines = [line.split(",") for line in lines]

#3.4
# Separate the header from the data
header = split_lines[0]
data_lines = split_lines[1:]
    
#3.5    
# Find "name" within the header
name_index = header.index("name")

#3.6
# Extract the names from the rows
unicode_names = [line[name_index] for line in data_lines]

#3.7
# Iterate over the names
translit_names = []
for unicode_name in unicode_names:
    # Perform the replacements in the translit_dict
    # HINT: ref [1]
    translit_name = unicode_name
    for key, value in translit_dict.items():
        translit_name = translit_name.replace(key, value)
    translit_names.append(translit_name)

#3.8
# Write out the names to a file named "data-ascii.txt"
# HINT: ref [2]
with open("data.exercise.M.1.ascii.txt", 'w') as outfile:
    for name in translit_names:
        outfile.write(name + "\n")
#3.9
# Verify that the names were converted and written out correctly
with open("data.exercise.M.1.ascii.txt", 'r') as infile:
    for line in infile:
        print(line.rstrip())


Richard Phillips Feynman
Shin'ichiro Tomonaga
Julian Schwinger
Rudolf Ludwig Moessbauer
Erwin Schroedinger
Paul Dirac
Maria Sklodowska-Curie
Pierre Curie


In [None]:
#=>Enter answer/reflection   
#3.1
#3.2 populate lines object while stripping off the newline from the end of each line
#3.3 Split each line based on the delimiter (which, in this case, is the comma)
#3.4 Separate the header from the data
#3.5 Find "name" within the header
#3.6 Extract the names from the rows
#3.7 Iterate over the names and replace character matches with translit_dict key/value.
#3.8 Write out the names to a file named "data-ascii.txt"
#3.9 Verify that the names were converted and written out correctly


**`Expected output`**
```
Richard Phillips Feynman
Shin'ichiro Tomonaga
Julian Schwinger
Rudolf Ludwig Moessbauer
Erwin Schroedinger
Paul Dirac
Maria Sklodowska-Curie
Pierre Curie
```

