In [1]:
import pandas as pd

# Working with files

## Reading data into pandas from a flat file

A common task is to work with data stored in one or more files. For example in a Comma Seperated Value (CSV) or other type of delimited (e.g. tab or pipe) file.  

There are a number of scenarios you may encounter:

1. The data file is held locally on your machine (or network drive)
2. The data file is accessed via a URL (e.g. it is located in GitHub or hosted on a third party website.)
3. The data file is compressed e.g. in a `.zip` format

> The good news is that reading from a local directory and reading from a remote URL is identical in `pandas`.  In both cases we can use the `pd.read_csv()` function specifying either the local path to the file or the url.

As an example let's read in the famous [Wisonsin Breast Cancer dataset](https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29) from Github.

In [2]:
# Wisconsin breast cancer dataset URL
url = 'https://raw.githubusercontent.com/health-data-science-OR/' \
      + 'hpdm139-datasets/main/wisconsin.csv'

# read into dataframe
df = pd.read_csv(url, index_col='id')

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 569 entries, 842302 to 92751
Data columns (total 32 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   Unnamed: 0               569 non-null    int64  
 1   diagnosis                569 non-null    object 
 2   radius_mean              569 non-null    float64
 3   texture_mean             569 non-null    float64
 4   perimeter_mean           569 non-null    float64
 5   area_mean                569 non-null    float64
 6   smoothness_mean          569 non-null    float64
 7   compactness_mean         569 non-null    float64
 8   concavity_mean           569 non-null    float64
 9   concave points_mean      569 non-null    float64
 10  symmetry_mean            569 non-null    float64
 11  fractal_dimension_mean   569 non-null    float64
 12  radius_se                569 non-null    float64
 13  texture_se               569 non-null    float64
 14  perimeter_se       

In [4]:
df.shape

(569, 32)

Now let's read in the same file, but this time it is compressed in .zip format.

In [5]:
url = 'https://raw.githubusercontent.com/health-data-science-OR/' \
      + 'hpdm139-datasets/main/wisconsin.zip'
df = pd.read_csv(url, index_col='id')

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 569 entries, 842302 to 92751
Data columns (total 32 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   Unnamed: 0               569 non-null    int64  
 1   diagnosis                569 non-null    object 
 2   radius_mean              569 non-null    float64
 3   texture_mean             569 non-null    float64
 4   perimeter_mean           569 non-null    float64
 5   area_mean                569 non-null    float64
 6   smoothness_mean          569 non-null    float64
 7   compactness_mean         569 non-null    float64
 8   concavity_mean           569 non-null    float64
 9   concave points_mean      569 non-null    float64
 10  symmetry_mean            569 non-null    float64
 11  fractal_dimension_mean   569 non-null    float64
 12  radius_se                569 non-null    float64
 13  texture_se               569 non-null    float64
 14  perimeter_se       