#  CS 594 / CS 690 - Assignment 01
### August 27, 2018
---

For this assignment, you must work in groups of one or two students. Each person is
responsible to write their own code, but the group will (together) discuss their solution.  In this notebook, we provide you with basic functions for completing the assignment.  *You will need to modify existing code and write new code to find a solution*.  Each member of the group must upload their own work to GitHub (which we will cover in the next lecture).

# Problem 1
In this problem we will explore reading in and parsing [delimiter-separated values](https://en.wikipedia.org/wiki/Delimiter-separated_values) stored in files.  We will start with [comma-separated values](https://en.wikipedia.org/wiki/Comma-separated_values) and then move on to [tab-separated values](https://en.wikipedia.org/wiki/Tab-separated_values).

### Problem 1a: Comma-Separated Values (CSV)

From [Wikipedia](https://en.wikipedia.org/wiki/Comma-separated_values): In computing, a comma-separated values (CSV) file stores tabular data (numbers and text) in plain text. Each line of the file is a data record. Each record consists of one or more fields, separated by commas. The use of the comma as a field separator is the source of the name for this file format.

If you were to consider the CSV file as a matrix, each line would represent a row and each comma would represent a column.  In the provided CSV file, the first row consists of a header that "names" each column.  In this problem, ...

- Count (and print) the number of rows of data (header is excluded) in the csv file
- Count (and print) the number of columns of data in the csv file
- Calculate (and print) the average of the values that are in the "age" column
  - You can assume each age in the file is an integer, but the average should be calculated as a float

In [2]:
def parse_delimited_file(filename, delimeter = ","):
    with open (filename, 'r', encoding= 'utf8') as csvfile:
        lines = csvfile.readlines()
    
    data = [x.rstrip() for x in lines]
    
    data1 = [line.split(',') for line in data]
    header = data1 [0]
    
    body = data1 [1:]
    
    index = header.index('age')
    
    num_data_rows = len (body)
    
    num_data_cols = len (header)
    
    sum = 0
    for y in body:
        sum = sum + int(y[1])
    avg = sum / num_data_rows    
    
    print ('',"Number of rows of data: {}". format (num_data_rows), '\n',
    "Number of cols : {}". format (num_data_cols), '\n', 
    "Average Age: {}". format (avg))
    
parse_delimited_file('data.csv')    

    

 Number of rows of data: 8 
 Number of cols : 3 
 Average Age: 70.875


**Expected Ouput:**
```
Number of rows of data: 8
Number of cols: 3
Average Age: 70.875
```
**References:**
- [1: open](https://docs.python.org/3.6/library/functions.html#open)
- [2: readlines](https://docs.python.org/3.6/library/codecs.html#codecs.StreamReader.readlines)
- [3: rstrip](https://docs.python.org/3.6/library/stdtypes.html#str.rstrip)
- [4: list comprehension](https://docs.python.org/3/tutorial/datastructures.html#list-comprehensions)
- [5: split](https://docs.python.org/3.6/library/stdtypes.html#str.split)
- [6: splice](https://docs.python.org/3.6/glossary.html#term-slice)
- [7: "more on lists"](https://docs.python.org/3.6/tutorial/datastructures.html#more-on-lists)
- [8: len](https://docs.python.org/3.6/library/functions.html#len)
- [9: int](https://docs.python.org/3.6/library/functions.html#int)
- [10: format](https://docs.python.org/3.6/library/stdtypes.html#str.format)


### Problem 1b: Tab-Separated Values (TSV)

From [Wikipedia](https://en.wikipedia.org/wiki/Tab-separated_values): A tab-separated values (TSV) file is a simple text format for storing data in a tabular structure, e.g., database table or spreadsheet data, and a way of exchanging information between databases. Each record in the table is one line of the text file. Each field value of a record is separated from the next by a tab character. The TSV format is thus a type of the more general delimiter-separated values format.

In this problem, repeat the analyses performed in the prevous problem, but for the provided tab-delimited file.

**NOTE:** the order of the columns has changed in this file.  If you hardcoded the position of the "age" column, think about how you can generalize the `parse_delimited_file` function to work for any delimited file with an "age" column.

In [19]:
def parse_delimited_file(filename, delimeter):
    with open (filename, 'r', encoding= 'utf8') as svfile:
        lines = svfile.readlines()
    
    data = [x.rstrip() for x in lines]
    
    data1 = [line.split(delimeter) for line in data]
    header = data1 [0]
    
    
    body = data1 [1:]
    
    index = header.index('age')

    num_data_rows = len (body)
    
    num_data_cols = len (header)
    
    sum = 0
    for y in body:
        sum = sum + int(y[index])
    avg = sum / num_data_rows    
    
    print ('',"Number of rows of data: {}". format (num_data_rows), '\n',
    "Number of cols : {}". format (num_data_cols), '\n', 
    "Average Age: {}". format (avg))
    
    
parse_delimited_file('data.tsv', delimeter = '\t')

 Number of rows of data: 8 
 Number of cols : 3 
 Average Age: 70.875


**Expected Ouput:**
```
Number of rows of data: 8
Number of cols: 3
Average Age: 70.875
```
---

# Problem 2
If you opened the `data.csv` file, you may have noticed some non-english letters in the names column.  These characters are represented using [Unicode](https://en.wikipedia.org/wiki/Unicode), a standard for representing many different types and forms of text.  Python 3 [natively supports](https://docs.python.org/3/howto/unicode.html) Unicode, but many tools do not.  Some tools require text to be formatted with [ASCII](https://en.wikipedia.org/wiki/ASCII).

Convert the unicode-formatted names into ascii-formated names, and save the names out to a file named `data-ascii.txt` (one name per line).  We have provided you with a [tranliteration dictionary](https://german.stackexchange.com/questions/4992/conversion-table-for-diacritics-e-g-%C3%BC-%E2%86%92-ue) that maps several common unicode characters to their ascii transliteration.  Use this dictionary to convert the unicode strings to ascii.

In [3]:
translit_dict = {
    "ä" : "ae",
    "ö" : "oe",
    "ü" : "ue",
    "Ä" : "Ae",
    "Ö" : "Oe",
    "Ü" : "Ue", 
    "ł" : "l",
    "ō" : "o",
}

with open("data.csv", 'r', encoding='utf8') as csvfile:
    lines = csvfile.readlines()

    data = [x.rstrip() for x in lines]
    
    data1 = [line.split(',') for line in data]

    header = data1[0]
    body = data1[1:]
    
    index = header.index('name')

    
unicode_names = []
import numpy as np  
data2 = np.array(body)  
unicode_names = data2[:,:1]

def replace(name):
    for key in translit_dict:
        name = name.replace(key, translit_dict[key])
    return name

translit_names = []
for unicode_name in unicode_names:
    translit_names.append(replace(unicode_name[0]))
    

with open("data-ascii.txt", 'w') as newfile:
    for name in translit_names:
        newfile.write(name+'\n')

with open("data-ascii.txt", 'r') as infile:
    for line in infile:
        print(line.rstrip())

Richard Phillips Feynman
Shin'ichiro Tomonaga
Julian Schwinger
Rudolf Ludwig Moessbauer
Erwin Schroedinger
Paul Dirac
Maria Sklodowska-Curie
Pierre Curie


**Expected Output:**
```
Richard Phillips Feynman
Shin'ichiro Tomonaga
Julian Schwinger
Rudolf Ludwig Moessbauer
Erwin Schroedinger
Paul Dirac
Maria Sklodowska-Curie
Pierre Curie
```

**References:**
- [1: replace](https://docs.python.org/3.6/library/stdtypes.html#str.replace)
- [2: file object methods](https://docs.python.org/3/tutorial/inputoutput.html#methods-of-file-objects)

# Free-Form Questions:
**Answer the following questions, in a couple sentences each, in the cells provided below.**

- Your solutions for Problems 1 & 2 probably share a lot of code in common.  You might even have copied-and-pasted from Problem 1 into Problem 2.  How would you refactor `parse_delimited_file` to be useful in both problems?
- Are there any pre-built Python packages that could help you solve these problems? How could you use them?
- List the key tasks you accomplished during this assignment.
- Describe the challenges you faced in addressing these tasks and how you overcame these challenges.
- Did you work with other students on this assignment? If yes, how did you help them? How did they help you? Be as specific as possible.

- I used `parse_delimited_file` in case of first problem. I used it to define a function which further operated in the data.csv file. In the second problem I directly used the csv file without using the function.


- Yes. The `numpy` package can be used to solve this problem. In fact, in the second problem I imported `numpy` which helped me to make an array of the data provided.


- The key tasks that I have accomplished in this assignment are:The tasks that are common for both the problmes:
- opened the csv file in `read` mode
- stripped a new line at the end of each line
- splitted each data based on the delimeter `,`
- seperated header from the data 
- calculated the column index for `age` or `name` 
    
- For problem 1:
- calculated the number of rows and column using `len` command
- calculated the average age using for loop

- For problem 2:
- imported numpy and use it to make an array of the data
- replaced the `unicode formatted names` with `ascii formatted names` using a predefined 
  `replace` function and the `translit_dict`
- saved the names in a text file named `ascii.txt`
  
  
- As a beginner in Python, I faced quite a challange in every aspect of solving these two   
  problems. But I am learning it and hope to overcome all the difficulties in near future. 
  
  
- I have not worked with any of the students yet. But I am eagerly looking forward to 
  working with others.