Reading CSV and TXT files 
    - pandas 
        - most typical use is based on the loading of info from files or sources of info 
            - for further exploration, transformation and analysis
    - section goals:
        - learn to read comma separated values like .csv and raw text files .txt into pandas DataFrames

In [2]:
import pandas as pd 

Reading Data with Python - Files
    - First thing is to open file -> open()
        - open(): single required argument -> path to file
              - has a single return, the file object
              - 

In [7]:
#  with statement: automatically takes care of closing the file once it leaves the with block, even in cases of error

#  opening the file 'btc-market-price.csv' for reading using a with statement and the open() function

filepath = 'btc-market-price.csv'

with open(filepath, 'r') as reader:
    print(reader)

<_io.TextIOWrapper name='btc-market-price.csv' mode='r' encoding='UTF-8'>


In [9]:
# Once the file is opened,  read its contents
# following code reads and prints the first 10 lines of the file 'btc-market-price.csv'.

filepath = 'btc-market-price.csv' # specifies the file path to 'btc-market-price.csv'.

# with statement opens the file 'btc-market-price.csv' in read mode ('r') and creates a file reader object named reader
with open(filepath, 'r') as reader:
    # loop iterates through the lines of the file using enumerate()
    # reads all the lines in the file using readlines()
    # provides both the line index (index) and the line content (line) in each iteration
    for index, line in enumerate(reader.readlines()):
        # read just the first 10 lines
        if (index < 10):  # ensures that only the first 10 lines are printed
            # prints the index (line number) and the content of each line to the console for the first 10 lines of the file
            print(index, line)  

0 2/4/17 0:00,1099.169125

1 3/4/17 0:00,1141.813

2 4/4/17 0:00,?

3 5/4/17 0:00,1133.079314

4 6/4/17 0:00,-

5 7/4/17 0:00,-

6 8/4/17 0:00,1181.149838

7 9/4/17 0:00,1208.8005

8 10/4/17 0:00,1207.744875

9 11/4/17 0:00,1226.617038



Reading Data with Pandas 
    - Most recurrent types of work for data analysis 
      - public data source
      - logs
      - historical information tables
      - exports from databases 
    - Pandas lib offers functions to read and write files in multiple formats
        - CSV
        - JSON
        - XML
        - Excels XLSX
      - creating DataFrame with info read from file
    - Learn how to read different type of data including:
          - CSV files (.csv)
          - Raw text files (.txt)
          - JSON data from a file and from an API
          - Data from a SQL query over a database

The read_csv Method 
    - Read comma separated values (CSV) files and raw text (TXT) files into a DataFrame
  - read_csv function
      - can specify a very broad set of parameters at import time
          - allows to accurately configure how data is read and parsed by specifying correct structure, encoding etc
      - Common parameters: 
        - filepath: Path of the file to be read.
        - sep: Character(s) that are used as a field separator in the file.
        - header: Index of the row containing the names of the columns (None if none).
        - index_col: Index of the column or sequence of indexes that should be used as index of rows of the data.
        - names: Sequence containing the names of the columns (used together with header = None).
        - skiprows: Number of rows or sequence of row indexes to ignore in the load.
        - na_values: Sequence of values that, if found in the file, should be treated as NaN.
        - dtype: Dictionary in which the keys will be column names and the values will be types of NumPy to which their content must be converted.
        - parse_dates: Flag that indicates if Python should try to parse data with a format similar to dates as dates. You can enter a list of column names that must be joined for the parsing as a date.
        - date_parser: Function to use to try to parse dates.
        - nrows: Number of rows to read from the beginning of the file.
        - skip_footer: Number of rows to ignore at the end of the file.
        - encoding: Encoding to be expected from the file read.
        - squeeze: Flag that indicates that if the data read only contains one column the result is a Series instead of a DataFrame.
        - thousands: Character to use to detect the thousands separator.
        - decimal: Character to use to detect the decimal separator.
        - skip_blank_lines: Flag that indicates whether blank lines should be ignored.

In [None]:
# First CSV file to read 
# When using read_csv method, pass an explicit filepath parameter indicating the path CSV file is
# Any valid string path is acceptable
#       - string could be a URL
#           - Valid URL schemes include HTTP, FTP, S3, and file
#           - For file URLs, a host is expected. A local file could be: file://localhost/path/to/table.csv


# read_csv method to load data directly from an URL:
csv_url = "https://raw.githubusercontent.com/datasets/gdp/master/data/gdp.csv"

pd.read_csv(csv_url).head()

In [15]:
# use a local file

df = pd.read_csv('btc-market-price.csv')

df.head()

# pandas -> infer everything related to data
# but in most of the cases need to explicitly tell pandas how data to be loaded

Unnamed: 0,2/4/17 0:00,1099.169125
0,3/4/17 0:00,1141.813
1,4/4/17 0:00,?
2,5/4/17 0:00,1133.079314
3,6/4/17 0:00,-
4,7/4/17 0:00,-


First row behavior with header parameter 

In [16]:
# CSV file being read has only two columns: Timestamp and Price
# Doesn't have a header
# Pandas automatically assigned the first row of data as headers, which is incorrect
# Overwrite this behavior with the header parameter

df = pd.read_csv('btc-market-price.csv',
                 header=None)

In [17]:
df.head()

Unnamed: 0,0,1
0,2/4/17 0:00,1099.169125
1,3/4/17 0:00,1141.813
2,4/4/17 0:00,?
3,5/4/17 0:00,1133.079314
4,6/4/17 0:00,-


Missing values with na_values parameter

In [19]:
# define na_values parameter with values  to be recognized as NA/NaN. In this case empty strings '', ? and - will be recognized as null values.

df = pd.read_csv('btc-market-price.csv',
                 header=None,   # By setting None, Pandas will assign default column names (integer values) to the DataFrame
                 na_values=['', '?', '-'])
                #  specifies a list of values that should be treated as missing values (NaN) when reading the CSV file.
                #   In this case, empty strings (''), question marks ('?'), and hyphens ('-') will be considered as missing values

In [20]:
df.head()

Unnamed: 0,0,1
0,2/4/17 0:00,1099.169125
1,3/4/17 0:00,1141.813
2,4/4/17 0:00,
3,5/4/17 0:00,1133.079314
4,6/4/17 0:00,


Column names using names parameter

In [21]:
# add that columns names using the names parameter

df = pd.read_csv('btc-market-price.csv',
                 header=None,
                 na_values=['', '?', '-'],
                 names=['Timestamp', 'Price'])

Column types using dtype parameter 
    - use dtype to force pandas to use a certain type 
        - w/o it, pandas will try to figure out each column type automatically

In [22]:
# force the Price column to be float.

df = pd.read_csv('btc-market-price.csv',
                 header=None,
                 na_values=['', '?', '-'],
                 names=['Timestamp', 'Price'],
                 dtype={'Price': 'float'})

In [23]:
df.head()

Unnamed: 0,Timestamp,Price
0,2/4/17 0:00,1099.169125
1,3/4/17 0:00,1141.813
2,4/4/17 0:00,
3,5/4/17 0:00,1133.079314
4,6/4/17 0:00,


In [25]:
df.dtypes

Timestamp     object
Price        float64
dtype: object

In [26]:
# From df.dtypes can see that: 
# - Timestamp column was interpreted as a regular string (object in pandas notation)
#         -> can parse it manually using a vectorized operation

#  parse Timestamp column to Datetime objects using to_datetime method:

pd.to_datetime(df['Timestamp']).head()

  pd.to_datetime(df['Timestamp']).head()


0   2017-02-04
1   2017-03-04
2   2017-04-04
3   2017-05-04
4   2017-06-04
Name: Timestamp, dtype: datetime64[ns]

In [27]:
df['Timestamp'] = pd.to_datetime(df['Timestamp'])

  df['Timestamp'] = pd.to_datetime(df['Timestamp'])


In [28]:
df.head()

Unnamed: 0,Timestamp,Price
0,2017-02-04,1099.169125
1,2017-03-04,1141.813
2,2017-04-04,
3,2017-05-04,1133.079314
4,2017-06-04,


In [29]:
df.dtypes

Timestamp    datetime64[ns]
Price               float64
dtype: object

Data parser using parse_dates parameter 
    - Another way to deal with Datetime objects 
        - Use parse_dates parameters with position of columns with dates

In [30]:
df = pd.read_csv('btc-market-price.csv',
                 header=None,
                 na_values=['', '?', '-'],
                 names=['Timestamp', 'Price'],
                 dtype={'Price': 'float'},
                 parse_dates=[0])

  df = pd.read_csv('btc-market-price.csv',


In [31]:
df.head()

Unnamed: 0,Timestamp,Price
0,2017-02-04,1099.169125
1,2017-03-04,1141.813
2,2017-04-04,
3,2017-05-04,1133.079314
4,2017-06-04,


In [32]:
df.dtypes

Timestamp    datetime64[ns]
Price               float64
dtype: object

Adding index to data using index_col parameter
    - Default,  pandas will automatically assign a numeric auto incremental index or row label starting with zero
    - Override default behavior by setting index_col property to a column
        - takes a numeric value representing the index or a string of the column name for setting a single column as index or a list of numeric values or strings for creating a multi-index

In [33]:
# choose the first column, Timestamp, as index (index=0) by passing zero to the index_col argument

df = pd.read_csv('btc-market-price.csv',
                 header=None,
                 na_values=['', '?', '-'],
                 names=['Timestamp', 'Price'],
                 dtype={'Price': 'float'},
                 parse_dates=[0],
                 index_col=[0])

  df = pd.read_csv('btc-market-price.csv',


In [34]:
df.head()

Unnamed: 0_level_0,Price
Timestamp,Unnamed: 1_level_1
2017-02-04,1099.169125
2017-03-04,1141.813
2017-04-04,
2017-05-04,1133.079314
2017-06-04,


In [35]:
df.dtypes

Price    float64
dtype: object

More Challenging Parsing
    - Reading another CSV file with following columns: 
      - first_name
      - last_name
      - age
      - math_score
      - french_score
      - next_test_date

In [36]:
exam_df = pd.read_csv('exam_review.csv')

In [37]:
exam_df

Unnamed: 0,Unnamed: 1,first_name>last_name>age>math_score>french_score
"Ray>Morley>18>""68","000"">""75","000"""
Melvin>Scott>24>77>83,,
Amirah>Haley>22>92>67,,
"Gerard>Mills>19>""78","000"">72",
Amy>Grimes>23>91>81,,


Custom data delimiters using sep parameter
    - Can define which delimiter to use by using the sep parameter
          - Else, pandas will automatically detect the separator
    - In most of the CSV files separator will be comma (,) and will be automatically detected
    - But can find files with other separators like semicolon (;), tabs (\t, specially on TSV files), whitespaces or any other special character

In [38]:
# Case: separator is a > character.

exam_df = pd.read_csv('exam_review.csv',
                      sep='>')

In [39]:
exam_df

Unnamed: 0,first_name,last_name,age,math_score,french_score
0,Ray,Morley,18,68000,75000
1,Melvin,Scott,24,77,83
2,Amirah,Haley,22,92,67
3,Gerard,Mills,19,78000,72
4,Amy,Grimes,23,91,81


Custom Data Encoding 
    - Files are stored using different "encodings"
            - Such as:  ASCII, UTF-8, latin1, etc
    - While reading data custom encoding can be defined with the encoding parameter: 
            - encoding='UTF-8': will be used if data is UTF-8 encoded.
            - encoding='iso-8859-1': will be used if data is ISO/IEC 8859-1 ("extended ASCII") encoded.



Custom numeric decimal and thousands character 
    - The decimal and thousands characters could change between datasets

In [40]:
# If have a column containing a comma (,) to indicate the decimal or thousands place
# this column would be considered a string and not numeric

exam_df = pd.read_csv('exam_review.csv',
                      sep='>')

In [41]:
exam_df

Unnamed: 0,first_name,last_name,age,math_score,french_score
0,Ray,Morley,18,68000,75000
1,Melvin,Scott,24,77,83
2,Amirah,Haley,22,92,67
3,Gerard,Mills,19,78000,72
4,Amy,Grimes,23,91,81


In [42]:
exam_df[['math_score', 'french_score']].dtypes

math_score      object
french_score    object
dtype: object

In [43]:
# use the decimal and/or thousands parameters to indicate correct decimal and/or thousands indicators

exam_df = pd.read_csv('exam_review.csv',
                      sep='>',
                      decimal=',')

In [44]:
exam_df

Unnamed: 0,first_name,last_name,age,math_score,french_score
0,Ray,Morley,18,68.0,75.0
1,Melvin,Scott,24,77.0,83.0
2,Amirah,Haley,22,92.0,67.0
3,Gerard,Mills,19,78.0,72.0
4,Amy,Grimes,23,91.0,81.0


In [45]:
exam_df[['math_score', 'french_score']].dtypes

math_score      float64
french_score    float64
dtype: object

In [47]:
# the thousands parameter:

pd.read_csv('exam_review.csv',
            sep='>',
            thousands=',')

Unnamed: 0,first_name,last_name,age,math_score,french_score
0,Ray,Morley,18,68000,75000
1,Melvin,Scott,24,77,83
2,Amirah,Haley,22,92,67
3,Gerard,Mills,19,78000,72
4,Amy,Grimes,23,91,81


Excluding Specific Rows
    - use skiprows to:
        - Exclude reading specified number of rows from the beginning of a file, by passing an integer argument. This removes the header too.
        - Skip reading specific row indices from a file, by passing a list containing row indices to skip

In [49]:
exam_df = pd.read_csv('exam_review.csv',
                      sep='>',  #  specifies the delimiter '>' as the separator between values in the CSV file
                      decimal=',')  # specifies the character ',', indicating that a comma is used as the decimal point

In [50]:
exam_df

Unnamed: 0,first_name,last_name,age,math_score,french_score
0,Ray,Morley,18,68.0,75.0
1,Melvin,Scott,24,77.0,83.0
2,Amirah,Haley,22,92.0,67.0
3,Gerard,Mills,19,78.0,72.0
4,Amy,Grimes,23,91.0,81.0


In [51]:
# skip reading the first 2 rows from this file
# use skiprows=2:

pd.read_csv('exam_review.csv',
            sep='>',
            skiprows=2)

Unnamed: 0,Melvin,Scott,24,77,83
0,Amirah,Haley,22,92,67
1,Gerard,Mills,19,78000,72
2,Amy,Grimes,23,91,81


In [52]:
# header is considered as the first row
# to skip reading data rows 1 and 3 use skiprows=[1,3]:

exam_df = pd.read_csv('exam_review.csv',
                      sep='>',
                      decimal=',',
                      skiprows=[1,3])

In [53]:
exam_df

Unnamed: 0,first_name,last_name,age,math_score,french_score
0,Melvin,Scott,24,77.0,83
1,Gerard,Mills,19,78.0,72
2,Amy,Grimes,23,91.0,81


Get rid of blank lines
    - skip_blank_lines parameter 
        - set to True so blank lines are skipped when reading lines 
        - if false, every blank line will be loaded with NaN values into the DataFrame

In [54]:
pd.read_csv('exam_review.csv',
            sep='>',
            skip_blank_lines=False)

# will print the NaNs

Unnamed: 0,first_name,last_name,age,math_score,french_score
0,Ray,Morley,18.0,68000.0,75000.0
1,Melvin,Scott,24.0,77.0,83.0
2,Amirah,Haley,22.0,92.0,67.0
3,,,,,
4,Gerard,Mills,19.0,78000.0,72.0
5,Amy,Grimes,23.0,91.0,81.0


Loading Specific Columns
    - Use usecols parameter to load just specific columns and not all of them
    - Pros: instead of loading an entire dataframe into memory and then deleting the not required columns
            -  can select the columns needed while loading the dataset itself

In [55]:
# parameter to usecols,
#      - pass either a list of strings corresponding to the column names or a list of integers corresponding to column index

pd.read_csv('exam_review.csv',
            usecols=['first_name', 'last_name', 'age'],
            sep='>')

Unnamed: 0,first_name,last_name,age
0,Ray,Morley,18
1,Melvin,Scott,24
2,Amirah,Haley,22
3,Gerard,Mills,19
4,Amy,Grimes,23


In [56]:
# or use just the column position

pd.read_csv('exam_review.csv',
            usecols=[0, 1, 2],
            sep='>')

Unnamed: 0,first_name,last_name,age
0,Ray,Morley,18
1,Melvin,Scott,24
2,Amirah,Haley,22
3,Gerard,Mills,19
4,Amy,Grimes,23


Using a Series instead of DataFrame

In [58]:
# if parsed data only contains one column then we can return a Series by setting the squeeze parameter to True
exam_test_1 = pd.read_csv('exam_review.csv',
                          sep='>',    #  '>' is used as the separator between values in the CSV file.
                          usecols=['last_name'])  # including only the 'last_name' column.

In [59]:
type(exam_test_1)

pandas.core.frame.DataFrame

In [None]:
exam_test_2 = pd.read_csv('exam_review.csv',
                          sep='>',
                          usecols=['last_name'],
                          squeeze=True)

Save to CSV File
    - Save DataFrame as a CSV file

In [61]:
exam_df

Unnamed: 0,first_name,last_name,age,math_score,french_score
0,Melvin,Scott,24,77.0,83
1,Gerard,Mills,19,78.0,72
2,Amy,Grimes,23,91.0,81


In [62]:
# generate a CSV string from our DataFrame:

exam_df.to_csv()

',first_name,last_name,age,math_score,french_score\n0,Melvin,Scott,24,77.0,83\n1,Gerard,Mills,19,78.0,72\n2,Amy,Grimes,23,91.0,81\n'

In [63]:
# Or specify a file path where we want our generated CSV code to be saved:

exam_df.to_csv('out.csv')

In [64]:
pd.read_csv('out.csv')

Unnamed: 0.1,Unnamed: 0,first_name,last_name,age,math_score,french_score
0,0,Melvin,Scott,24,77.0,83
1,1,Gerard,Mills,19,78.0,72
2,2,Amy,Grimes,23,91.0,81


In [65]:
exam_df.to_csv('out.csv',
               index=None)

# index=None parameter specifies dont include index column in the exported CSV file
# Setting index to None prevents the index from being saved as a separate column in the CSV file


In [66]:
pd.read_csv('out.csv')

Unnamed: 0,first_name,last_name,age,math_score,french_score
0,Melvin,Scott,24,77.0,83
1,Gerard,Mills,19,78.0,72
2,Amy,Grimes,23,91.0,81
