# COSC 526 - Assignment 01
### January 29, 2021
---

In this notebook, we provide you with basic functions for completing the assignment.  *You will need to modify existing code and write new code to find a solution*.  Each member of the group must upload their own work to their personal GitHub repository, which we set up during the first class.

# Practical Tasks:

This set of practical tasks is to be completed during the first class.

**Definitions:**
- **GitHub:** web-based hosting service for version control used to distribute and collect assignments as well as other class materials (e.g., slides, code, and datasets)
- **Git:** software used by GitHub

**Practical Tasks:** 
- Create your own GitHub account
- Submit your GitHub username to the Google form: https://forms.gle/CKugke8Dzqjm9tQ89
- Install Git on your laptop

**This Assignment is due (pushed to your personal class GitHub repository) at the start of the second class.**

# Problem 1

In this problem we explore reading in and parsing [delimiter-separated values](https://en.wikipedia.org/wiki/Delimiter-separated_values) stored in files.  We start with [comma-separated values](https://en.wikipedia.org/wiki/Comma-separated_values) and then move on to [tab-separated values](https://en.wikipedia.org/wiki/Tab-separated_values).

### Problem 1a: Comma-Separated Values (CSV)

From [Wikipedia](https://en.wikipedia.org/wiki/Comma-separated_values): In computing, a comma-separated values (CSV) file stores tabular data (numbers and text) in plain text. Each line of the file is a data record. Each record consists of one or more fields, separated by commas. The use of the comma as a field separator is the source of the name for this file format.

If you were to consider the CSV file as a matrix, each line would represent a row and each comma would represent a column.  In the provided CSV file, the first row consists of a header that "names" each column.  In this problem, ...

- Count (and print) the number of rows of data (header is excluded) in the csv file
- Count (and print) the number of columns of data in the csv file
- Calculate (and print) the average of the values that are in the "age" column
  - You can assume each age in the file is an integer, but the average should be calculated as a float

In [3]:
def parse_delimited_file(filename, delimiter=","):
    # Open and read in all lines of the file
    # (I do not recommend readlines for LARGE files)
    # `open`: ref [1]
    # `readlines`: ref [2]
    with open(filename, 'r', encoding='utf8') as dsvfile:
        lines = dsvfile.readlines()

    # Strip off the newline from the end of each line
    # Using list comprehension is the recommended pythonic way to iterate through lists
    # HINT: refs [3,4]

    List = []
    List = [line.rstrip('\n') for line in lines]#List comprehension


    # Split each line based on the delimiter (which, in this case, is the comma)
    # HINT: ref [5]
    ListSplit = []
    ListSplit = [line.split(delimiter) for line in List]
    
    
    
    # Separate the header from the data
    # HINT: ref [6]
    ListSplitHeader = ListSplit[0]
    ListSplitData = ListSplit[slice(1,None)]#"None" to include the last entry



    
    # Find "age" within the header
    # (i.e., calculating the column index for "age")
    # HINT: ref [7]
    x = ListSplitHeader.index('age')


    # Calculate the number of data rows and columns
    # HINT: [8]

    num_data_rows = len(ListSplitData)
    num_data_cols = len(ListSplitHeader)
    
    # Sum the "age" values
    # HINT: ref [9]
    age = [int(entry[x]) for entry in ListSplitData]
    totalAge = sum(age)
        
        
    # Calculate the average age
    ave_age = totalAge/num_data_rows
    
    # Print the results
    # `format`: ref [10]
    print("Number of rows of data: {}".format(num_data_rows))
    print("Number of cols: {}".format(num_data_cols))
    print("Average Age: {}".format(ave_age))
    
# Parse the provided csv file
parse_delimited_file('data.csv')

Number of rows of data: 8
Number of cols: 3
Average Age: 70.875


**Expected Ouput:**
```
Number of rows of data: 8
Number of cols: 3
Average Age: 70.875
```
**References:**
- [1: open](https://docs.python.org/3.6/library/functions.html#open)
- [2: readlines](https://docs.python.org/3.6/library/codecs.html#codecs.StreamReader.readlines)
- [3: list comprehension](https://docs.python.org/3/tutorial/datastructures.html#list-comprehensions)
- [4: rstrip](https://docs.python.org/3.6/library/stdtypes.html#str.rstrip)
- [5: split](https://docs.python.org/3.6/library/stdtypes.html#str.split)
- [6: splice](https://docs.python.org/3.6/glossary.html#term-slice)
- [7: "more on lists"](https://docs.python.org/3.6/tutorial/datastructures.html#more-on-lists)
- [8: len](https://docs.python.org/3.6/library/functions.html#len)
- [9: int](https://docs.python.org/3.6/library/functions.html#int)
- [10: format](https://docs.python.org/3.6/library/stdtypes.html#str.format)


### Problem 1b: Tab-Separated Values (TSV)

From [Wikipedia](https://en.wikipedia.org/wiki/Tab-separated_values): A tab-separated values (TSV) file is a simple text format for storing data in a tabular structure, e.g., database table or spreadsheet data, and a way of exchanging information between databases. Each record in the table is one line of the text file. Each field value of a record is separated from the next by a tab character. The TSV format is thus a type of the more general delimiter-separated values format.

In this problem, repeat the analyses performed in the prevous problem, but for the provided tab-delimited file.

**NOTE:** the order of the columns has changed in this file.  If you hardcoded the position of the "age" column, think about how you can generalize the `parse_delimited_file` function to work for any delimited file with an "age" column.

In [100]:
# Further reading on optional arguments, like "delimiter": http://www.diveintopython.net/power_of_introspection/optional_arguments.html
parse_delimited_file('data.tsv', delimiter="\t")

Number of rows of data: 8
Number of cols: 3
Average Age: 70.875


**Expected Ouput:**
```
Number of rows of data: 8
Number of cols: 3
Average Age: 70.875
```
---

# Problem 2

If you opened the `data.csv` file, you may have noticed some non-english letters in the names column.  These characters are represented using [Unicode](https://en.wikipedia.org/wiki/Unicode), a standard for representing many different types and forms of text.  Python 3 [natively supports](https://docs.python.org/3/howto/unicode.html) Unicode, but many tools do not.  Some tools require text to be formatted with [ASCII](https://en.wikipedia.org/wiki/ASCII).

Convert the unicode-formatted names into ascii-formated names, and save the names out to a file named `data-ascii.txt` (one name per line).  We have provided you with a [tranliteration dictionary](https://german.stackexchange.com/questions/4992/conversion-table-for-diacritics-e-g-%C3%BC-%E2%86%92-ue) that maps several common unicode characters to their ascii transliteration.  Use this dictionary to convert the unicode strings to ascii.

In [17]:
translit_dict = {
    "ä" : "ae",
    "ö" : "oe",
    "ü" : "ue",
    "Ä" : "Ae",
    "Ö" : "Oe",
    "Ü" : "Ue", 
    "ł" : "l",
    "ō" : "o",
}

with open("data.csv", 'r', encoding='utf8') as csvfile:
    lines = csvfile.readlines()

delimiter=","

# Strip off the newline from the end of each line
List = []
List = [line.rstrip('\n') for line in lines]#List comprehension

    
# Split each line based on the delimiter (which, in this case, is the comma)
ListSplit = []
ListSplit = [line.split(delimiter) for line in List]

# Separate the header from the data
ListSplitHeader = ListSplit[0]
ListSplitData = ListSplit[slice(1,None)]#"None" to include the last entry

    
# Find "name" within the header
x = ListSplitHeader.index('name')

# Extract the names from the rows
unicode_names = [entry[x] for entry in ListSplitData]
# print(unicode_names)

# Iterate over the names
translit_names = []      
        
for unicode_name in unicode_names:
    # Perform the replacements in the translit_dict
    for letter in translit_dict:
        if letter in unicode_name:
            unicode_name = unicode_name.replace(letter,translit_dict[letter])
#             print(unicode_name)
    translit_names.append(unicode_name)

# print(translit_names)        
    # HINT: ref [1]
#     False

# Write out the names to a file named "data-ascii.txt"
# HINT: ref [2]
with open('data-ascii.txt','w') as ascii_file:
    for lines in translit_names:
        ascii_file.write(lines+'\n')  
ascii_file.close()


# Verify that the names were converted and written out correctly
with open("data-ascii.txt", 'r') as infile:
    for line in infile:
        print(line.rstrip())

Richard Phillips Feynman
Shin'ichiro Tomonaga
Julian Schwinger
Rudolf Ludwig Moessbauer
Erwin Schroedinger
Paul Dirac
Maria Sklodowska-Curie
Pierre Curie


**Expected Output:**
```
Richard Phillips Feynman
Shin'ichiro Tomonaga
Julian Schwinger
Rudolf Ludwig Moessbauer
Erwin Schroedinger
Paul Dirac
Maria Sklodowska-Curie
Pierre Curie
```

**References:**
- [1: replace](https://docs.python.org/3.6/library/stdtypes.html#str.replace)
- [2: file object methods](https://docs.python.org/3/tutorial/inputoutput.html#methods-of-file-objects)

# Free-Form Questions:

Q1. Your solutions for Problems 1 & 2 probably share a lot of code in common. You might even have copied-and-pasted from Problem 1 into Problem 2. Refactor parse_delimited_file to be useful in both problems

In [41]:
# Add here your code 
# The function defined for Q1a can be used for Q1b too.


def parse_delimited_file_refactored(filename, delimiter): #delimiter can be given as an input
    # Open and read in all lines of the file
    # (I do not recommend readlines for LARGE files)
    # `open`: ref [1]
    # `readlines`: ref [2]
    with open(filename, 'r', encoding='utf8') as dsvfile:
        lines = dsvfile.readlines()

    # Strip off the newline from the end of each line
    # Using list comprehension is the recommended pythonic way to iterate through lists
    # HINT: refs [3,4]

    List = []
    List = [line.rstrip('\n') for line in lines]#List comprehension


    # Split each line based on the delimiter (which, in this case, is the comma)
    # HINT: ref [5]
    ListSplit = []
    ListSplit = [line.split(delimiter) for line in List]
    
#     print(ListSplit)
    
    # Separate the header from the data
    # HINT: ref [6]
    ListSplitHeader = ListSplit[0] #Assuming the header is always at the top.
    ListSplitData = ListSplit[slice(1,None)]#"None" to include the last entry



    
    # Find "age" within the header
    # (i.e., calculating the column index for "age")
    # HINT: ref [7]
    x = ListSplitHeader.index('age')


    # Calculate the number of data rows and columns
    # HINT: [8]

    num_data_rows = len(ListSplitData)
    num_data_cols = len(ListSplitHeader)
    
    # Sum the "age" values
    # HINT: ref [9]
    age = [int(entry[x]) for entry in ListSplitData]
    totalAge = sum(age)
        
        
    # Calculate the average age
    ave_age = totalAge/num_data_rows
    
    # Print the results
    # `format`: ref [10]
    print("Number of rows of data: {}".format(num_data_rows))
    print("Number of cols: {}".format(num_data_cols))
    print("Average Age: {}".format(ave_age))


#Examples
print('Opening CSV File\n')
parse_delimited_file_refactored('data.csv',',')
print('\n\n')
print('Opening TSV File\n')
parse_delimited_file_refactored('data.tsv','\t')

Opening CSV File

Number of rows of data: 8
Number of cols: 3
Average Age: 70.875



Opening TSV File

Number of rows of data: 8
Number of cols: 3
Average Age: 70.875


Q2. Are there any pre-built Python packages that could help you solve these problems? If yes, refactor your solutions to use those packages. 

In [46]:
# Add here your code 
import csv
with open('data.csv',encoding="utf8") as csv_file:
    data_csv = csv.reader(csv_file, delimiter=',')
    
    List = [row for row in data_csv]
#     print(List)
    ListSplitHeader = List[0]
    ListSplitData = ListSplit[slice(1,None)]
    # Find "age" within the header
    # (i.e., calculating the column index for "age")
    # HINT: ref [7]
    x = ListSplitHeader.index('age')


    # Calculate the number of data rows and columns
    # HINT: [8]

    num_data_rows = len(ListSplitData)
    num_data_cols = len(ListSplitHeader)
    
    # Sum the "age" values
    # HINT: ref [9]
    age = [int(entry[x]) for entry in ListSplitData]
    totalAge = sum(age)
        
        
    # Calculate the average age
    ave_age = totalAge/num_data_rows
    
    # Print the results
    # `format`: ref [10]
    print("Number of rows of data: {}".format(num_data_rows))
    print("Number of cols: {}".format(num_data_cols))
    print("Average Age: {}".format(ave_age))

Number of rows of data: 8
Number of cols: 3
Average Age: 70.875


Q3. Tell us about your experience (for each point below provide a couple of sentences).
- Describe the challenges you faced in addressing these tasks and how you overcame these challenges.
- Did you work with other students on this assignment? If yes, how did you help them? How did they help you?

Using the dictionary was a challenge. Went through the Python documentation.
I don't know about a tsv reader in Python.

# Live Chat: The History of Big Data

Intel's Genevieve Bell shows that we have been dealing with big data for millennia, and that approaching big data problems with the right frame of reference is the key addressing many of the problems we face today from the keynote of Supercomputing 2013: https://youtu.be/CNoi-XqwJnA

List three key concepts you learned by watching the video.

By the time a data collection event is finished, that data  might already become outdated.
The example of media man in India where a data point is set of nestd data points.
The data collected depends on the circumstance when it was collcted.
Visualisation of data is important for example in the case of Cholera, it was important to collect the data on where the deaths occured.

# Live Chat: What we learned from 5 million books!

### Live Chat:
Jean-Baptiste Michel and Erez Lieberman Aiden tell us about “What we learned from 5 million books”
https://www.ted.com/talks/jean_baptiste_michel_erez_lieberman_aiden_what_we_learned_from_5_million_books

Answer these questions related to the talk:
- What is the take-away of this talk? Summarize it in up to 3 sentences.
- What are metadata?
- What is a n-gram?
- What is the suppression index? 
- What is culturomics? 

-The talk was about digitization of books. There is data and metadata for these books. Scanning through these books can actually help us in studying the culture of human beings who lived in that particular era.

-Metadata ameans the wherabouts of actual data, in this case it is publishing date, place etc of the books.

-An n-gram is a contiunous sequence of n items.

-Supression index is the ratio of the observed number of occurences of a certain word/name to the expected number of occurences of that word/name.

-Cultoromics is the application of massive-scale data collection analysis to study the human culture.

# Reading Assignment: MapReduce: Simplied Data Processing on Large Clusters

Use the three-pass approch to read the paper: Jeffrey Dean and Sanjay Ghemawat (2004) MapReduce: Simplied Data Processing on Large Clusters.