## Data type in pandas

 > <b>DataFrame</b>: Two-dimensional, size-mutable, potentially heterogeneous tabular data. Contains labeled axes (rows and columns). Arithmetic operations align on both row and column labels. Can be thought of as a dict-like container for Series objects. The primary pandas data structure.
* May have many dimensions
* We usually call this a "dataset"
* Rows and columns are indexed, numerical indices starting at 0 by default
* Columns may have names (taken from the first line in the .csv file)
* Each column has a fixed data type (Python will try to infer the best type according to the data)
* May be initialized with a dictionary ```<class 'dict'>```

<br>

> <b>Series</b>: One-dimensional ndarray with axis labels (including time series).
* Usually represents a column of a table
* May contain only 1 type of data

## Reading data sources

 
* In order to work with the data, it must be represented in tabular form 
* * Sometimes our data is tabular – we just need to read it
* * In other cases, we need to create a tabular structure from the raw data
* Structured vs. Unstructured data 
* * Unstructured data: data that doesn't have a model
* * Examples for unstructured data: Images, plain text, audio, web pages
 
#### Reading data from: 
* Tables in a text format such as .csv:  ```pd.read_csv("data.csv")``` 
* Spreadsheets (such as Excel or Google Sheets):  ```pd.read_table()``` or  ```pd.read_excel()``` 
* Web services:  ``` pd.read_json()```
* SQL: ```conda install pyodbc``` 
 ```python
import pyodbc
conn = pyodbc.connect("DRIVER={SQL Server};SERVER=...;DATABASE=...;Trusted_Connection=true")
customer_info = pd.read_sql("select * from Sales.Customer",conn)  
```

#### Exploring dataframe: 
* Column names: ```dataframe.columns ```
* Column types: ```dataframe.dtypes```
* Index values: ```dataframe.index ```  
* Dimensions: ```dataframe.shape``` 
* * Format: ```(rows, columns)```  
* Show the first few rows: ```dataframe.head()```
* Return unique values of column: ```dataframe["column_name"].unique()``` 
* Group DataFrame by column, applying a function, and combining the results: ```dataframe.groupby("column_name").mean()``` 
* Access group of values using labels: ```dataframe.loc["column_name"]``` 
* Access group of rows and columns by integer position(s): ```dataframe.iloc[i]```
* Access a single value for a row/column label pair: ```dataframe.at[row, "column_name"]```


#### Merging many data sources
* Automate the process as much as possible
* * From reading the raw data to getting the processed dataset
* * If the dataset changes or updates, you'll just re-run your code
* Document the process
* Create as few datasets as possible
* * I.e. merge many sources into one table if you can 
* Ensure the different sources are compatible and consistent
* * If they aren't, process the raw data 
* * Most common example: Mismatched IDs
* Make sure all column types are correct (```dataframe.dtypes```)
* * Example: str type for a numeric column 
  
 

## Examples
#### I

In [80]:
import pandas as pd
import numpy as np

A_and_B = pd.DataFrame({"A": [1, 2, 3, -1, -2, -3], "B": [3, 4, 5, -3, -4, -5]})
A_and_B

Unnamed: 0,A,B
0,1,3
1,2,4
2,3,5
3,-1,-3
4,-2,-4
5,-3,-5


In [81]:
print(">> Type A_and_B:", type(A_and_B), "\n")
print(">> Type A_and_B[\"A\"]:", type(A_and_B["A"]), "\n")  
print(">> Shape A_and_B:", A_and_B.shape, "\n")
print(">> Shape A_and_B[\"A\"]:", A_and_B["A"].shape, "\n")

>> Type A_and_B: <class 'pandas.core.frame.DataFrame'> 

>> Type A_and_B["A"]: <class 'pandas.core.series.Series'> 

>> Shape A_and_B: (6, 2) 

>> Shape A_and_B["A"]: (6,) 



In [82]:
numbers = pd.Series([1, 2, 3, 4.3, 5.4])
numbers

0    1.0
1    2.0
2    3.0
3    4.3
4    5.4
dtype: float64

In [83]:
# Add new columns to datafarame A_B

A_and_B.insert(0, "Nums", numbers, True) # the last row will be NaN since the array has lenght 5

A_and_B.insert(2, "C", np.arange(6), True)

A_and_B = A_and_B.assign(D=['A', 'B', 'C', 'D', 'E', 'F'])

A_and_B['E'] = np.zeros(6)

A_and_B.loc[:, "F"] = np.random.rand(6)

A_and_B['AB'] = A_and_B['A'] + A_and_B['B']

EC = A_and_B['E'] + A_and_B['C'] 
EF = A_and_B['E'] + A_and_B['F']
new_data = {'EC': EC, 'EF': EF }
A_and_B = A_and_B.assign(**new_data) 

A_and_B[A_and_B["A"] < 0]

Unnamed: 0,Nums,A,C,B,D,E,F,AB,EC,EF
3,4.3,-1,3,-3,D,0.0,0.155973,-4,3.0,0.155973
4,5.4,-2,4,-4,E,0.0,0.308686,-6,4.0,0.308686
5,,-3,5,-5,F,0.0,0.425555,-8,5.0,0.425555


#### II

In [20]:
accidents_data = pd.read_csv('data/2.1_accidents.csv')
accidents_data

Unnamed: 0,Miles from Home,% of Accidents
0,less than 1,23
1,1 to 5,29
2,6 to 10,17
3,11 to 15,8
4,16 to 20,6
5,over 20,17


In [108]:
# Change column names
accidents_data.columns = ["miles","%"]
accidents_data

Unnamed: 0,miles,%
0,less than 1,23
1,1 to 5,29
2,6 to 10,17
3,11 to 15,8
4,16 to 20,6
5,over 20,17


In [109]:
print(">> accidents_data[\"miles\"]: \n", accidents_data["miles"], "\n")
print(">> accidents_data[\"miles\"][0]: \n", accidents_data["miles"][0], "\n") 
print(">> accidents_data.loc[1][:\"%\"]: \n", accidents_data.loc[1][:"%"], "\n")
print(">> accidents_data.loc[1,[\"miles\",\"%\"]]: \n", accidents_data.loc[1,["miles","%"]], "\n")
print(">> accidents_data.iloc[1, :2]: \n", accidents_data.iloc[1, :2], "\n")  
print(">> accidents_data.dtypes: \n", accidents_data.dtypes, "\n") 
print(">> accidents_data.columns: \n", accidents_data.columns, "\n") 
print(">> accidents_data.index: \n", accidents_data.index, "\n") 

>> accidents_data["miles"]: 
 0    less than 1
1         1 to 5
2        6 to 10
3       11 to 15
4       16 to 20
5        over 20
Name: miles, dtype: object 

>> accidents_data["miles"][0]: 
 less than 1 

>> accidents_data.loc[1][:"%"]: 
 miles    1 to 5
%            29
Name: 1, dtype: object 

>> accidents_data.loc[1,["miles","%"]]: 
 miles    1 to 5
%            29
Name: 1, dtype: object 

>> accidents_data.iloc[1, :2]: 
 miles    1 to 5
%            29
Name: 1, dtype: object 

>> accidents_data.dtypes: 
 miles    object
%         int64
dtype: object 

>> accidents_data.columns: 
 Index(['miles', '%'], dtype='object') 

>> accidents_data.index: 
 RangeIndex(start=0, stop=6, step=1) 



#### III

In [40]:
car_data = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/car/car.data", header = None)
car_data

Unnamed: 0,0,1,2,3,4,5,6
0,vhigh,vhigh,2,2,small,low,unacc
1,vhigh,vhigh,2,2,small,med,unacc
2,vhigh,vhigh,2,2,small,high,unacc
3,vhigh,vhigh,2,2,med,low,unacc
4,vhigh,vhigh,2,2,med,med,unacc
...,...,...,...,...,...,...,...
1723,low,low,5more,more,med,med,good
1724,low,low,5more,more,med,high,vgood
1725,low,low,5more,more,big,low,unacc
1726,low,low,5more,more,big,med,good


In [41]:
# Add column names
car_data.columns = ["buying", "maint", "doors", "persons", "lug_boot", "safety", "acceptability"]
car_data

Unnamed: 0,buying,maint,doors,persons,lug_boot,safety,acceptability
0,vhigh,vhigh,2,2,small,low,unacc
1,vhigh,vhigh,2,2,small,med,unacc
2,vhigh,vhigh,2,2,small,high,unacc
3,vhigh,vhigh,2,2,med,low,unacc
4,vhigh,vhigh,2,2,med,med,unacc
...,...,...,...,...,...,...,...
1723,low,low,5more,more,med,med,good
1724,low,low,5more,more,med,high,vgood
1725,low,low,5more,more,big,low,unacc
1726,low,low,5more,more,big,med,good


In [51]:
print(">> # Checking the unique values of a variable as evidence that it is categorical rather than numeric.")
print(">> car_data[\"doors\"].unique(): \n", car_data["doors"].unique(), "\n") 
print(">> car_data[\"buying\"].unique(): \n", car_data["buying"].unique(), "\n")  
print(">> car_data.groupby(\"buying\").size(): \n", car_data.groupby("buying").size(), "\n")
print(">> car_data.groupby(\"persons\").size(): \n", car_data.groupby("persons").size(), "\n")  

>> # Checking the unique values ​​of a variable as evidence that it is categorical rather than numeric.
>> car_data["doors"].unique(): 
 ['2' '3' '4' '5more'] 

>> car_data["buying"].unique(): 
 ['vhigh' 'high' 'med' 'low'] 

>> car_data.groupby("buying").size(): 
 buying
high     432
low      432
med      432
vhigh    432
dtype: int64 

>> car_data.groupby("persons").size(): 
 persons
2       576
4       576
more    576
dtype: int64 



IV

In [75]:
books = pd.read_json("https://openlibrary.org/api/books?bibkeys=ISBN:9780345354907,ISBN:0881847690,LCCN:2005041555&format=json")
books

Unnamed: 0,ISBN:9780345354907,ISBN:0881847690,LCCN:2005041555
bib_key,ISBN:9780345354907,ISBN:0881847690,LCCN:2005041555
info_url,https://openlibrary.org/books/OL9831606M/The_C...,https://openlibrary.org/books/OL22232644M/Watc...,https://openlibrary.org/books/OL3421202M/At_th...
preview,restricted,restricted,restricted
preview_url,https://archive.org/details/caseofcharlesdex00...,https://archive.org/details/watchersoutoftim00...,https://archive.org/details/atmountainsofmad00...
thumbnail_url,https://covers.openlibrary.org/b/id/207586-S.jpg,https://covers.openlibrary.org/b/id/9871313-S.jpg,https://covers.openlibrary.org/b/id/8259841-S.jpg


In [76]:
books.loc['preview_url', 'ISBN:9780345354907']

'https://archive.org/details/caseofcharlesdex00hplo'