# Lecture 19

## Reading Data



In [None]:
!ls 

In [None]:
!cat Scores.csv

## Comma Separated Values (CSV) File Format

The simplest and most common file format for storing data is called Comma Separated Values (CSV). Generally a CSV file represents a table, with the top row (first line of the file) consisting of the labels of the columns (separated by commas). Generally each column keeps a different feature or field. For example for student data, the first column could be the name, second the ID, third major, etc. After the first line, each row hold the data for one data point or example. In the case of student data, each row could correspond to one student.

### Reading CSV Files

There are lots of libraries for reading CSV files into memory. Before we start using them, lets write our own. We'll need two things:

* Means of reading and interpreting the file.
* A representation of the read data in memory.

We have written a simple CSV reader before. Lets recall one of the ways in python to read a file.

In [None]:
f=open("Scores.csv","r")
first_line = f.readline()

line = f.readline()
while line:
    print(line)
    line = f.readline()

f.close()

Let's work on the first line, which is special:

In [None]:
f=open("Scores.csv","r")
first_line = f.readline()
print(first_line.split(","))
f.close()

It appears that each line ends with `\n`. Here's how we can remove these newlines.

In [None]:
f=open("Scores.csv","r")
first_line = f.readline().rstrip()
print(first_line.split(","))
f.close()

Finally lets store the first line, which is a list of the column names:

In [None]:
f=open("Scores.csv","r")
first_line = f.readline().rstrip()
fields=first_line.split(",")
f.close()

In [None]:
f=open("Scores.csv","r")
first_line = f.readline().rstrip()
fields=first_line.split(",")

data=list()

line = f.readline().rstrip()
while line:
    data.append(line.split(","))
    line = f.readline().rstrip()

f.close()

In [None]:
data

We have the basics down, but we have more things to consider:
* We have written some example code, we should now write something that is general and we could use in different instances. 
* The fields can be different types: strings, numbers (integer or floating point). We should store the fields as the correct data type.
* Still need to figure out how we will store the data in memory.


We have some options on how to proceed:
   * We could write a CSV reader function that given the filename of a CSV file, reads the data and returns it as a standard python data object. There are various suitable such representations, so we'll either have to pick one or provide some options to allow for other ones.
   * Instead of a CSV reader function, we could create a CSV reader class. It will be instantiated with a CSV filename, so each instance would be uniquely connected to a specific file. It'll read the data into some representation that is kept private. We provide accessor methods to get to keep retrieve specific parts of the data, or the whole data as standard python data.
   * We can separate the concepts of a CSV reader and how we store the data. In this way, we could write other readers (e.g. Excel file reader) that would still use the same data storage.
   * We might want to also be able to write out CSV files.

In [None]:
class DataFileHandler:
    def __init__(self,extensions):
        self.__extensions=extensions
        
    def check_extension(self,filename):
        file_extension=filename.split(".")[-1]
        return file_extension in self.__extensions

    def _readfile(self,filename):
        raise NotImplementedError    
        
    def readfile(self,filename,check_extension=True):
        if not check_extension or self.check_extension(filename):
            return self._readfile(filename)        
        else:
            print("Error: filename {} does not match acceptable extensions.".format(filename))
    
    def _writefile(self,filename,data):
        raise NotImplementedError
        
    def writefile(self,filename,data):
        return self._writefile(filename,data)
        
        
class CSVHandler(DataFileHandler):
    def __init__(self):
        #super(CSVHandler,self).__init__(["csv","CSV"])
        DataFileHandler.__init__(self,["csv","CSV"])
        
    def _readfile(self,filename):
        f=open(filename,"r")
        first_line = f.readline().rstrip()
        fields=first_line.split(",")

        data=list()

        line = f.readline().rstrip()
        while line:
            data.append(line.split(","))
            line = f.readline().rstrip()

        f.close()
        
        return fields,data
        
    

In [None]:
my_handler=CSVHandler()
my_handler.readfile("Scores.csv")

## Pandas

In [None]:
import pandas as pd
Data=pd.read_csv("Scores.csv")

In [None]:
Data