# Python Data Analysis Notes: Importing & Exploring Datasets

## 🔧 Setup

In [2]:
import pandas as pd
import numpy as np

#### Common Data Format I/O Table

| Format   | Read Function                        | Save Function                     | Example File          |
|----------|--------------------------------------|-----------------------------------|------------------------|
| CSV      | `pd.read_csv()`                      | `df.to_csv()`                     | `data.csv`             |
| Excel    | `pd.read_excel()`                    | `df.to_excel()`                   | `data.xlsx`            |
| TSV      | `pd.read_csv(sep='\t')`              | `df.to_csv(sep='\t')`             | `data.tsv`             |
| JSON     | `pd.read_json()`                     | `df.to_json()`                    | `data.json`            |
| SQL      | `pd.read_sql()`                      | `df.to_sql()`                     | SQLite / SQLAlchemy    |
| Pickle   | `pd.read_pickle()`                   | `df.to_pickle()`                  | `data.pkl`             |
| Parquet  | `pd.read_parquet()`                  | `df.to_parquet()`                 | `data.parquet`         |
| Feather  | `pd.read_feather()`                  | `df.to_feather()`                 | `data.feather`         |
| NumPy    | `np.loadtxt()` / `np.genfromtxt()`   | `np.savetxt()`                    | `array.txt`            |
| ZIP/GZ   | `pd.read_csv(compression='zip'/'gzip')`     | `df.to_csv(compression='zip'/'gzip')`    | `data.zip`, `data.csv.gz` |

#### Options

**Example:**

```python
df = pd.read_csv('data/sample.csv', 
                 sep=',',           # delimiter
                 header=0,          # the 1st row to use as header
                 index_col=0,       # the 1st column to use as index
                 na_values=['NA', '?'],  # handle missing values
                 dtype={'col1': int}     # set column types
                )
```

> **header = None** means no header row 
> 
> **na_values** specifies additional strings to recognize as NaN
> 
> **np.nan** is used to represent missing values

See below:

In [None]:
# Create a csv file 
csv_file = 'data/data.csv'
with open(csv_file, 'w') as f:
    f.write('Alice,30,New York\n')
    f.write('Bob,25,Los Angeles\n')
    f.write('Charlie,35,Chicago\n')
    f.write('David,40,?\n')
    f.write('Eva,28,Phoenix\n')

# Load the csv file
data = pd.read_csv(csv_file, header=None, na_values=['?', 'N/A', 'null', np.nan])
data.head()


Unnamed: 0,0,1,2
0,Alice,30,New York
1,Bob,25,Los Angeles
2,Charlie,35,Chicago
3,David,40,
4,Eva,28,Phoenix


We can assign headers to the DataFrame using `df.columns`:

In [20]:
# define header names
data.columns = ['Name', 'Age', 'City']
data.head(2)

Unnamed: 0,Name,Age,City
0,Alice,30,New York
1,Bob,25,Los Angeles


In [19]:
data.info() # Display information about the DataFrame

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Name    5 non-null      object
 1   Age     5 non-null      int64 
 2   City    4 non-null      object
dtypes: int64(1), object(2)
memory usage: 252.0+ bytes


`df.describe(include=, exclude=, percentiles=)` used for Summary statistics for columns

> Options:
> 
> By default, it returns count, mean, std, min, 25%, 50%, 75%, and max for numeric columns only
>
> **include:** `None` by default, including numeric columns only
> 
>`include='all'` includes all columns
>
>You can also specify the data types (e.g., include = ['object', 'number', 'category'])
> 
> **exclude:** Exclude specific data types (e.g., exclude=['object'])
> 
> **percentiles:** Customize percentiles shown (default: [0.25, 0.5, 0.75])

In [26]:
data.describe(include='all') # Get a statistical summary of the DataFrame

Unnamed: 0,Name,Age,City
count,5,5.0,4
unique,5,,4
top,Alice,,New York
freq,1,,1
mean,,31.6,
std,,5.94138,
min,,25.0,
25%,,28.0,
50%,,30.0,
75%,,35.0,


Check missing values per column:

In [27]:
data.isnull().sum()

Name    0
Age     0
City    1
dtype: int64

# Thanks for reading!

#### Written by @hellorito