# Intro to Pandas

In [1]:
import pandas as pd

## Series

A series is a 1-dimensional array that can hold any data type such as integers, floats and strings.
It is similar to a column in a spreadsheet or a SQL table.
The example below contains three examples of series, but we will be running a series with a list of cars.

In [2]:
cars = pd.Series(["BMW", "Toyota", "Honda"]) # List
# series = pd.Series({'a': 'BMW', 'b': 'Toyota', 'c': 'Honda'}) # Dictionary
# series = pd.Series(5, index=[0, 1, 2, 3, 4, 5]) # Scalar
cars

0       BMW
1    Toyota
2     Honda
dtype: object

In [3]:
colors = pd.Series(["Red", "Blue", "White"])
colors

0      Red
1     Blue
2    White
dtype: object

## DataFrames

A DataFrame is a 2-dimensional array that can hold any data type similar to a series.
It is size-mutable, meaning that columns can be inserted and deleted from the DataFrame.
Can be heterogeneous, i.e. it can contain data of different types with a structure consisting of labels and axes.
- Row (axis=0), Column (axis=1)

Additionally, it can support missing data, which is marked as NaN.

In [4]:
car_data = pd.DataFrame({"Car make": cars, "Color": colors})
car_data

Unnamed: 0,Car make,Color
0,BMW,Red
1,Toyota,Blue
2,Honda,White


### Import data from CSV file

The example will import a CSV file containing car sales data.
However, you can import data using a URL. Here is what that would look like:

```
car_sales_tracker_df = pd.read_csv("https://raw.githubusercontent.com/miguel-wgu/SimpleMLProject/main/data/car_sales_tracker.csv")
```
The url must be a raw file from GitHub, and this one comes directly from my GitHub repo.

In [5]:
car_sales_tracker_df = pd.read_csv("data/car_sales_tracker.csv")
car_sales_tracker_df

Unnamed: 0,Make,Colour,Odometer (KM),Doors,Price
0,Toyota,White,150043,4,"$4,000.00"
1,Honda,Red,87899,4,"$5,000.00"
2,Toyota,Blue,32549,3,"$7,000.00"
3,BMW,Black,11179,5,"$22,000.00"
4,Nissan,White,213095,4,"$3,500.00"
5,Toyota,Green,99213,4,"$4,500.00"
6,Honda,Blue,45698,4,"$7,500.00"
7,Honda,Blue,54738,4,"$7,000.00"
8,Toyota,White,60000,4,"$6,250.00"
9,Nissan,White,31600,4,"$9,700.00"


### Export data to CSV file

The most common parameters for the to_csv() method are:
- File path - save location set as a string
- Index - whether to write row names (index)
    - Default is True and will write the row names as the first column

Other parameters can include: columns to include, separator, date format, etc.

In [6]:
car_sales_tracker_df.to_csv("data/exported_car_sales_tracker.csv", index=False)
# can export to excel with .to_excel()

## Describing data

Here we will be using the dtypes attribute.
An attribute is metadata about the object, and it is accessed using a dot (.) followed by the attribute name.
The difference between an attribute and a function is that a function is an action that can be performed on the object and is accessed using a dot (.) followed by the function name and parentheses ().
```
#Function (or method)
car_sales_tracker_df.to_csv()

#Attribute
car_sales_tracker_df.dtypes
```

In [7]:
car_sales_tracker_df.dtypes

Make             object
Colour           object
Odometer (KM)     int64
Doors             int64
Price            object
dtype: object

### Columns and Index

The columns attribute will return the column names of the DataFrame, while the index attribute will return the index (range of rows) beginning with 0.
The example below will store the columns names as a list in the variable car_attributes_columns.

In [11]:
car_attributes_columns = car_sales_tracker_df.columns
car_attributes_columns

Index(['Make', 'Colour', 'Odometer (KM)', 'Doors', 'Price'], dtype='object')

In [10]:
car_sales_tracker_df.index

RangeIndex(start=0, stop=10, step=1)

### describe() Function

The describe() function will return a statistical summary of the DataFrame.
The example below will return various statistical values, but only for the Odometer and Doors columns.
If you view the dtypes attribute, you will see that the Odometer and Doors columns are the only columns with numeric values. The rest are objects.

In [13]:
car_sales_tracker_df.describe()

Unnamed: 0,Odometer (KM),Doors
count,10.0,10.0
mean,78601.4,4.0
std,61983.471735,0.471405
min,11179.0,3.0
25%,35836.25,4.0
50%,57369.0,4.0
75%,96384.5,4.0
max,213095.0,5.0


### info() Function

The info() function will return a concise summary of the DataFrame. It will return the index, column names, non-null values, and data types.
It is essentially a combination of the dtypes, columns, and index attributes.

In [14]:
car_sales_tracker_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 5 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   Make           10 non-null     object
 1   Colour         10 non-null     object
 2   Odometer (KM)  10 non-null     int64 
 3   Doors          10 non-null     int64 
 4   Price          10 non-null     object
dtypes: int64(2), object(3)
memory usage: 528.0+ bytes
