## Pandas Data Type
 - Series
    - Series is a single column
    - Index i.e row name can be given to the column values
    - Set index when creating a series by supplying index=['x', 'y', 'z']
    - Set s.index=index['x', 'y', 'z'] if series is already created
    - Allows access to a value based on its index postion as well as key name if a key/index name has been defined
    - Are useful to extract data from a dataframe or merging data into the dataframe
 - Data Frame
   - df.heads(2) to get a sample data
   - df.columns get all cloumns
   - Setting index of a dataframe pd.dataframe(df, index=['x'. 'y', 'z'])
   - pd.read_csv('file path) to read data from a csv file
   - pd.to_csv('file path) to write data to a csv file
   - df.shape to know all rows and columns 
   - index method iloc (uses numerical index) vs. loc uses labels
   - select a column, multiple columns
   - Describe, Groupby, size to find out information about a column and so on 


In [1]:

import pandas as pd 
import numpy as np

In [None]:
# A simple series
s = pd.Series([1,2,3,4])

s 

In [2]:
# Create a series holding population values of ca, tx, fl, ny

s = pd.Series([38, 26, 19, 19])

s


0    38
1    26
2    19
3    19
dtype: int64

In [None]:
# Above index can be named as states so information in the series can be more useful
# It can be extracted and updated based on index name

s = pd.Series([38,26,19,19], index=['ca', 'tx', 'fl', 'ny'])

s



In [8]:
# reset index alone unlike above where index is given when creating a series
s.index = index=['ca', 'fl', 'tx', 'ny']

s

ca    38
fl    26
tx    19
ny    19
dtype: int64

In [9]:
# Extrace values of a series

s.values


array([38, 26, 19, 19], dtype=int64)

In [None]:
# Extract index of a series

s.index

In [None]:
# Run a mathematical operation, return boolean values for a state population more than 20 millions

s > 20

### DataFrame
 - Create a data frame by adding some sample population for each state for previous years

In [2]:
df = {'ca':[35, 36, 38], 'tx': [23, 24, 26], 'fl' : [14, 16, 19], 'ny' : [15, 16, 19]}

pd.DataFrame(df) 

Unnamed: 0,ca,tx,fl,ny
0,35,23,14,15
1,36,24,16,16
2,38,26,19,19


In [None]:
# In above example, default index value is 0,1,2
# Change them to some year, say 2020, 2021, 2022

df = {'ca':[35, 36, 38], 'tx': [23, 24, 26], 'fl' : [14, 16, 19], 'ny' : [15, 16, 19]}

pd.DataFrame(df, index = [2020, 2021, 2022 ])



### Read data from a csv file
 - This sample file was downloded from Kaggle site
 - Read file from the directory
 - See number of columns and rows by "shape"
 - Make use of head method to see only few rows
 - Make use of column method to see what columns are there

In [None]:
pd.read_csv("data/winemag-data-130k-v2.csv", nrows=10)

In [None]:
# Above file already has a inbuilt index "Unnamed Coloumn". This coulmn can be defined to use as index

wine_review = pd.read_csv("data/winemag-data-130k-v2.csv", nrows=10, index_col=0)

wine_review

In [None]:
# find out the shape. There are no "()" in shape
wine_review = pd.read_csv("data/winemag-data-130k-v2.csv")

wine_review.shape


In [None]:
# Inspect data using head method

wine_review.head(2)

In [None]:
# Print all columns

wine_review.columns

### Indexing, Selecting and Assigning
- This is from Kaggle site
- First access data from the dataframe using dot property and [] bracket
- Indexing using iloc and loc and key difference between them
- 


In [None]:
reviews = pd.read_csv("data/winemag-data-130k-v2.csv", index_col=0)

pd.set_option('display.max_rows', 5)

reviews

In [None]:
# Access a column using dot propery and dictionary []

reviews.country

In [None]:
reviews['country']

In [None]:
# Select 1 country and define the index

reviews['country'][0]

In [None]:
# Access multiple columns using []

reviews[['country', 'region_1']] 

### Indexing in Pandas
 - iloc and loc
 - iloc uses dataframe indices and loc uses 'lables' for indices
 - panda indexing is row first and column second unlike python that is column first and row second
 

In [None]:
# displays 1st row of the dataframe
reviews.iloc[0]

In [None]:
# displays 1st column of the dataframe

reviews.iloc[:,0]

In [None]:
# Typically : means everything, hence displayed all rows above, we can define number of rows as well

reviews.iloc[:2,0]

In [None]:
# Define a range of rows as well as columns to be returned
reviews.iloc[0:4, [0,1,2]]

In [None]:
# Some examples using loc

reviews.loc[0:4,'country']

In [None]:
# Find all colunms

reviews.columns

In [None]:
# Find out taster_name, taster_twitter_handle, points

reviews.loc[:, ['taster_name', 'taster_twitter_handle', 'points']]

In [None]:
# Using set_index to set the index on any column that is much more meaningful

reviews.set_index('title')

In [None]:
# Now look at the condititonal selection
# Select all italian wines
# Select all italian wines AND scored above 90
# Select all italian wines OR scored above 90

reviews.country == 'Italy'

In [None]:
# Learning - first you need to select a boolean index by executing above code and then only below selection will work
reviews.loc [reviews.country == 'Italy']

In [None]:
# Now add points conditions as AND

reviews.loc[(reviews.country == 'Italy') & (reviews.points >= 90)]

In [None]:
# OR condition, Italy OR above 90 points

reviews.loc[(reviews.country == 'Italy') | (reviews.points >= 90)]

In [None]:
# There will be many columns that may not have any value, filter them using isnull and notnull value

reviews.loc[reviews.price.notnull()]

In [None]:
# finding isnull values

reviews.loc[reviews.price.isnull()]

In [None]:
renamed = reviews

renamed.rename(columns= {'region_1' : 'region', 'region_2' : 'locale'})

In [None]:
# Set index to "wine", solution suggest that it is asking to reset rows axis as "wine" but not very clear from the question

# reviews.set_index('winery')

In [None]:
# Set row axis as "wine"

# reviews.rename_axis("wine", axis="rows")

## Summarizing values and mapping functions in the data frame

In [None]:
reviews.head()

In [None]:
reviews.describe()

In [None]:
# Describe just points

reviews.points.describe()

In [None]:
reviews.price.describe()

In [None]:
reviews.points.mean()

In [None]:
# Unique values

reviews.taster_name.unique()

In [None]:
# Groupby function
# Groupby on points and the show points count
# Groupby on points and then show princ min
# Groupby on country and then price agg

reviews.groupby('points').points.count()

In [None]:
reviews.groupby('points').price.min()

In [None]:
reviews.groupby(['country']).price.agg([len, min, max])