## Data type in pandas

 > <b>DataFrame</b>: Two-dimensional, size-mutable, potentially heterogeneous tabular data. Contains labeled axes (rows and columns). Arithmetic operations align on both row and column labels. Can be thought of as a dict-like container for Series objects. The primary pandas data structure.
* May have many dimensions
* We usually call this a "dataset"
* Rows and columns are indexed, numerical indices starting at 0 by default
* Columns may have names (taken from the first line in the .csv file)
* Each column has a fixed data type (Python will try to infer the best type according to the data)
* May be initialized with a dictionary ```<class 'dict'>```

<br>

> <b>Series</b>: One-dimensional ndarray with axis labels (including time series).
* Usually represents a column of a table
* May contain only 1 type of data

## Reading data sources

 
* In order to work with the data, it must be represented in tabular form 
* * Sometimes our data is tabular – we just need to read it
* * In other cases, we need to create a tabular structure from the raw data
* Structured vs. Unstructured data 
* * Unstructured data: data that doesn't have a model
* * Examples for unstructured data: Images, plain text, audio, web pages
 
#### Reading data from: 
* Tables in a text format such as .csv:  ```pd.read_csv("data.csv")``` 
* Spreadsheets (such as Excel or Google Sheets):  ```pd.read_table()``` or  ```pd.read_excel()``` 
* Web services:  ``` pd.read_json()```
* SQL: ```conda install pyodbc``` 
 ```python
import pyodbc
conn = pyodbc.connect("DRIVER={SQL Server};SERVER=...;DATABASE=...;Trusted_Connection=true")
customer_info = pd.read_sql("select * from Sales.Customer",conn)  
```

#### Exploring dataframe: 
* Column names: ```dataframe.columns ```
* Column types: ```dataframe.dtypes```
* Index values: ```dataframe.index ```  
* Dimensions: ```dataframe.shape``` 
* * Format: ```(rows, columns)```  
* Show the first few rows: ```dataframe.head()```
* Return unique values of column: ```dataframe["column_name"].unique()``` 
* Group DataFrame by column, applying a function, and combining the results: ```dataframe.groupby("column_name").mean()``` 
* Access group of values using labels: ```dataframe.loc["column_name"]``` 
* Access group of rows and columns by integer position(s): ```dataframe.iloc[i]```
* Access a single value for a row/column label pair: ```dataframe.at[row, "column_name"]```


#### Merging many data sources
* Automate the process as much as possible
* * From reading the raw data to getting the processed dataset
* * If the dataset changes or updates, you'll just re-run your code
* Document the process
* Create as few datasets as possible
* * I.e. merge many sources into one table if you can 
* Ensure the different sources are compatible and consistent
* * If they aren't, process the raw data 
* * Most common example: Mismatched IDs
* Make sure all column types are correct (```dataframe.dtypes```)
* * Example: str type for a numeric column 
  
 