# Import data

Flat files - .txt, .csvs
xls, stata, sas, matlab
sqllite, mysql

## Plain text files
2 types: plain text, table data ie. csv

In [None]:
filename = 'huck_finn.txt'
file = open(filename, mode='r') # 'r' is to read
text = file.read()
file.close()
print(text)

# writing to a file
filename = 'huck_finn.txt'
file = open(filename, mode='w') # 'w' is to write
file.close()

In [None]:
# Context manager with
with open('huck_finn.txt', 'r') as file:
	print(file.read())
## best practice since you don't have to close the file

In [None]:
# Example
# Open a file: file
file = open('moby_dick.txt', 'r')

# Print it
print(file.read())

# Check whether file is closed
print(file.closed)

# Close file
file.close()

# Check whether file is closed
print(file.closed)

## Flat Files - .csv, .txt
what is it: text files containing records
table data
record = row of fields or attributes
column is a feature
header = column names
Import flat files with NumPy (for number arrays) and Pandas (for dataframe)

## Numpy import flat files
numpy arrays standard for storing numerical data
essential for other packages ie. scikit-learn
loadtxt()
genfromtxt()

There are a number of arguments that np.loadtxt() takes that you'll find useful: delimiter changes the delimiter that loadtxt() is expecting, for example, you can use ',' and '\t' for comma-delimited and tab-delimited respectively; skiprows allows you to specify how many rows (not indices) you wish to skip; usecols takes a list of the indices of the columns you wish to keep.

In [None]:
import numpy as np
filename = 'MNIST.txt'
data = np.loadtxt(filename, delimiter=',')
data

# additional arguments
# skip header row
data = np.loadtxt(filename, delimiter=',', skiprows=1)
# get 1st and 3rd row
data = np.loadtxt(filename, delimiter=',', skiprows=1, usecols=[0,2])
# import as str type
data = np.loadtxt(filename, delimiter=',', skiprows=1, dtype=str)
# mixed datatypes - don't use np.loadtxt(), use panda dataframe


You can find more information about the MNIST dataset here on the webpage of Yann LeCun, who is currently Director of AI Research at Facebook and Founding Director of the NYU Center for Data Science, among many other things.
http://yann.lecun.com/exdb/mnist/

In this exercise, you're now going to load the MNIST digit recognition dataset using the numpy function loadtxt() and see just how easy it can be:

In [None]:
# example
# Import package
import numpy as np

# Assign filename to variable: file
file = 'digits.csv'

# Load file as array: digits
digits = np.loadtxt(file, delimiter=',')

# Print datatype of digits
print(type(digits))

# Select and reshape a row
im = digits[21, 1:]
im_sq = np.reshape(im, (28, 28))

# Plot reshaped data (matplotlib.pyplot already loaded as plt)
plt.imshow(im_sq, cmap='Greys', interpolation='nearest')
plt.show()

In [None]:
# example: customize NumPy import
# Import numpy
import numpy as np

# Assign the filename: file
file = 'digits_header.txt'

# Load the data: data
data = np.loadtxt(file, delimiter='\t', skiprows=1, usecols=[0,2])

# Print data
print(data)

Importing different datatypes
The file seaslug.txt

has a text header, consisting of strings
is tab-delimited.

Due to the header, if you tried to import it as-is using np.loadtxt(), Python would throw you a ValueError and tell you that it could not convert string to float. There are two ways to deal with this: firstly, you can set the data type argument dtype equal to str (for string).

Alternatively, you can skip the first row as we have seen before, using the skiprows argument.

In [None]:
# Assign filename: file
file = 'seaslug.txt'

# Import file: data
data = np.loadtxt(file, delimiter='\t', dtype=str)

# Print the first element of data
print(data[0])

# Import data as floats and skip the first row: data_float
data_float = np.loadtxt(file, delimiter='\t', dtype=float, skiprows=1)

# Print the 10th element of data_float
print(datad_float[9])

# Plot a scatterplot of the data
plt.scatter(data_float[:, 0], data_float[:, 1])
plt.xlabel('time (min.)')
plt.ylabel('percentage of larvae')
plt.show()

Working with mixed datatypes (1)

Much of the time you will need to import datasets which have different datatypes in different columns; one column may contain strings and another floats, for example. The function np.loadtxt() will freak at this. 

There is another function, np.genfromtxt(), which can handle such structures. 
If we pass dtype=None to it, it will figure out what types each column should be.

Import 'titanic.csv' using the function np.genfromtxt() as follows:

data = np.genfromtxt('titanic.csv', delimiter=',', names=True, dtype=None)

Here, the first argument is the filename, the second specifies the delimiter , and the third argument names tells us there is a header. Because the data are of different types, data is an object called a structured array. Because numpy arrays have to contain elements that are all the same type, the structured array solves this by being a 1D array, where each element of the array is a row of the flat file imported. You can test this by checking out the array's shape in the shell by executing np.shape(data).

Acccessing rows and columns of structured arrays is super-intuitive: to get the ith row, merely execute data[i] and to get the column with name 'Fare', execute data['Fare'].

In [None]:
# np.genfromtxt() with names for header, dtype=None will auto determine dtypes
data = np.genfromtxt('titanic.csv', delimiter=',', names=True, dtype=None)

There is also another function np.recfromcsv() that behaves similarly to np.genfromtxt(), except that its default dtype is None. In this exercise, you'll practice using this to achieve the same result.

In [None]:
# this function is like np.genfromtxt but dtype=None is default
np.recfromcsv()