# Pandas Intro

Pandas Series : https://pandas.pydata.org/pandas-docs/stable/reference/series.html

Pandas DataFrame: https://pandas.pydata.org/pandas-docs/stable/reference/frame.html

Pandas Arrays: https://pandas.pydata.org/pandas-docs/stable/reference/arrays.html



Series
- Pandas Series is a one-dimensional labeled array capable of holding data of any type 
- The axis labels are collectively called index. Pandas Series is nothing but a column in an excel sheet.


DataFrame
- A Data frame is a two-dimensional data structure, i.e., data is aligned in a tabular fashion in rows and columns. 
- Pandas DataFrame consists of three principal components, the data, rows, and columns. 

Arrays
- For most data types, pandas uses NumPy arrays as the concrete objects contained with a Index, Series, or DataFrame.
- For some data types, pandas extends NumPy’s type system. String aliases for these types can be found at dtypes.

# Pandas Functions List

## Input/Output Functions

##### Read

In [None]:
read_csv(filepath_or_buffer, pathlib.Path, …)   Read a comma-separated values (csv) file into DataFrame.
read_excel(io[, sheet_name, header, names, …])  Read an Excel file into a pandas DataFrame.
ExcelFile.parse(self[, sheet_name, header, …])  Parse specified sheet(s) into a DataFrame.
read_html(io[, match, flavor, header, …])       Read HTML tables into a list of DataFrame objects.
read_gbq(query, project_id, …[, …])             Load data from Google BigQuery.

In [None]:
import pandas as pd
sal = pd.read_csv("Salaries.csv")
sal.head()

## General Functions

## Data Manipulation Functions

In [None]:
melt(frame[, id_vars, value_vars, var_name, …])  Unpivot a DataFrame from wide to long format, optionally leaving identifiers set.
pivot(data[, index, columns, values])            Return reshaped DataFrame organized by given index / column values.
pivot_table(data[, values, index, columns, …])   Create a spreadsheet-style pivot table as a DataFrame.
crosstab(index, columns[, values, rownames, …])  Compute a simple cross tabulation of two (or more) factors.

cut(x, bins, right[, labels])                    Bin values into discrete intervals.
qcut(x, q[, labels])                             Quantile-based discretization function.
merge(left, right, how[, on, left_on, …])        Merge DataFrame or named Series objects with a database-style join.
merge_ordered(left, right[, on, left_on, …])     Perform merge with optional filling/interpolation.
merge_asof(left, right[, on, left_on, …])        Perform an asof merge.

concat(objs, ForwardRef]], …[, axis, …])         Concatenate pandas objects along a particular axis with optional set logic along the other axes.
get_dummies(data[, prefix, prefix_sep, …])       Convert categorical variable into dummy/indicator variables.
factorize(values, sort, na_sentinel, …)          Encode the object as an enumerated type or categorical variable.
unique(values)                                   Hash table-based unique.
wide_to_long(df, stubnames, i, j, sep, suffix)   Wide panel to long format.

##### Top-level missing data


In [None]:
isna(obj)                                        Detect missing values for an array-like object.
isnull(obj)                                      Detect missing values for an array-like object.
notna(obj)                                       Detect non-missing values for an array-like object.
notnull(obj)                                     Detect non-missing values for an array-like object.

##### Top-level dealing with datetimelike
 

In [None]:
to_datetime(arg[, errors, dayfirst, …])         Convert argument to datetime.
to_timedelta(arg[, unit, errors])               Convert argument to timedelta.
date_range([start, end, periods, freq, tz, …])  Return a fixed frequency DatetimeIndex.
bdate_range([start, end, periods, freq, tz, …]) Return a fixed frequency DatetimeIndex, with business day as the default frequency.
period_range([start, end, periods, freq, name]) Return a fixed frequency PeriodIndex.
timedelta_range([start, end, periods, freq, …]) Return a fixed frequency TimedeltaIndex, with day as the default frequency.
infer_freq(index, warn)                         Infer the most likely frequency given the input index.

## Series and DataFrame Functions

##### Attributes

In [None]:
- Series.index   - The index (axis labels) of the Series.
- Series.array   - The ExtensionArray of the data backing this Series or Index.
- Series.values  - Return Series as ndarray or ndarray-like depending on the dtype.
- Series.dtype   - Return the dtype object of the underlying data.
- Series.shape   - Return a tuple of the shape of the underlying data.
- Series.nbytes  - Return the number of bytes in the underlying data.
- Series.ndim    - Number of dimensions of the underlying data, by definition 1.
- Series.size    - Return the number of elements in the underlying data.
- Series.T       - Return the transpose, which is by definition self.
- Series.hasnans - Return if I have any nans; enables various perf speedups.
- Series.empty

##### Conversion

In [None]:
Series.astype(self, dtype, copy, errors)   - Cast a pandas object to a specified dtype dtype.
Series.copy(self, deep)                    - Make a copy of this object’s indices and data.
Series.bool(self)                          - Return the bool of a single element PandasObject.
Series.to_list(self)                       - Return a list of the values.


##### Indexing, Iteration

In [None]:
.loc                            Access a group of rows and columns by label(s) or a boolean array.
.iloc                           Purely integer-location based indexing for selection by position.

.get(self, key[, default])      Get item from object for given key (ex: DataFrame column).
.at                             Access a single value for a row/column label pair.
.iat                            Access a single value for a row/column pair by integer position.

.__iter__(self)                 Return an iterator of the values.
.items(self)                    Lazily iterate over (index, value) tuples.
.iteritems(self)                Lazily iterate over (index, value) tuples.
.keys(self)                     Return alias for index.
.pop(self, item)                Return item and drop from frame.
.item(self)                     Return the first element of the underlying data as a python scalar.
.xs(self, key[, axis, level])   Return cross-section from the Series/DataFrame.

##### Binary Operator Functions

In [None]:
.add(self, other[, level, fill_value, …])     Return Addition of series 
.sub(self, other[, level, fill_value, …])     Return Subtraction 
.mul(self, other[, level, fill_value, …])     Return Multiplication 
.div(self, other[, level, fill_value, …])     Return Floating division 
.truediv(self, other[, level, …])             Return Floating division 
.floordiv(self, other[, level, …])            Return Integer division 
.mod(self, other[, level, fill_value, …])     Re turn Modulo
.pow(self, other[, level, fill_value, …])     Return Exponential power 
.ewm(self[, com, span, halflife, …])         Provide exponential weighted functions.

.combine(self, other, func[, fill_value])     Combine according to func.
.round(self[, decimals])                      Round each value in a Series to the given number of decimals.
.lt(self, other[, level, fill_value, axis])   Return Less than  
.gt(self, other[, level, fill_value, axis])   Return Greater than  
.le(self, other[, level, fill_value, axis])   Return Less than or equal to  
.ge(self, other[, level, fill_value, axis])   Return Greater than or equal to
.ne(self, other[, level, fill_value, axis])   Return Not equal to 
.eq(self, other[, level, fill_value, axis])   Return Equal to 
.product(self[, axis, skipna, level, …])      Return the product of the values for the requested axis.
.dot(self, other)                             Compute the dot product 

##### Function Application, Groupby & Window

In [None]:
.apply(self, func[, convert_dtype, args])    Invoke function on values of Series.
.agg(self, func[, axis])                     Aggregate using operations over the specified axis.
.aggregate(self, func[, axis])               Aggregate using operations over the specified axis.
.transform(self, func[, axis])               Call func, produces transformed values.
.map(self, arg[, na_action])                 Map values according to input correspondence.
.groupby(self[, by, axis, level])            Group using a mapper 
.rolling(self, window[, min_periods, …])     Provide rolling window calculations.
.expanding(self[, min_periods, …])           Provide expanding transformations.
.ewm(self[, com, span, halflife, …])         Provide exponential weighted functions.
.pipe(self, func, \*args, \*\*kwargs)        Apply func(self, *args, **kwargs)

##### Computations / descriptive stats


In [None]:

# Data Manipulation
.between(self, left, right[, inclusive])     boolean Series equivalent to left <= series <= right.
.clip(self[, lower, upper, axis])            Trim values at input threshold(s).
.value_counts(self[, normalize, sort, …])    Return a Series containing counts of unique value
.prod(self[, axis, skipna, level, …])        Return the product of the values for the requested axis.
.unique(self)                                unique values of Series object.
.nunique(self[, dropna])                     Return number of unique elements in the object.

# Simple Stats
.describe(self[, percentiles, …])            Generate descriptive statistics.
.abs(self)                                   absolute numeric value of each element.
.count(self[, level])                        number of non-NA/null observations in the Series.
.mean(self[, axis, skipna, level, …])        mean of the values for the requested axis.
.median(self[, axis, skipna, level, …])      median of the values for the requested axis.
.max(self[, axis, skipna, level, …])         maximum of the values for the requested axis.
.min(self[, axis, skipna, level, …])         minimum of the values for the requested axis.
.mode(self[, dropna])                        mode(s) of the dataset.
.std(self[, axis, skipna, level, …])         Return sample standard deviation over requested axis.
.sum(self[, axis, skipna, level, …])         Return the sum of the values for the requested axis.
.quantile(self[, q, interpolation])          Return value at the given quantile.
.rank(self[, axis])                          Compute numerical data ranks (1 through n) along axis.
.pct_change(self[, periods, …])              Percentage change between the current and a prior element.
.corr(self, other[, method, min_periods])    Compute correlation (excluding missing values)
.cov(self, other[, min_periods])             Compute covariance, excluding missing values.
.var(self[, axis, skipna, level, …])         Return unbiased variance over requested axis.

# More advanced Stats
.autocorr(self[, lag])                       Compute the lag-N autocorrelation.
.kurt(self[, axis, skipna, level, …])        unbiased kurtosis over requested axis.
.mad(self[, axis, skipna, level])            mean absolute deviation of the values for the requested axis.
.skew(self[, axis, skipna, level, …])        Return unbiased skew over requested axis.
.kurtosis(self[, axis, skipna, level, …])    Return unbiased kurtosis over requested axis.
.sem(self[, axis, skipna, level, …])         Return unbiased standard error of the mean over requested axis.

# Other
.all(self[, axis, bool_only, skipna, …])     Return whether all elements are True, potentially over an axis.
.any(self[, axis, bool_only, skipna, …])     Return whether any element is True, potentially over an axis.

.nlargest(self[, n, keep])                   Return the largest n elements.
.nsmallest(self[, n, keep])                  Return the smallest n elements.

.cummax(self[, axis, skipna])               cumulative maximum over a DataFrame or Series axis.
.cummin(self[, axis, skipna])               cumulative minimum over a DataFrame or Series axis.
.cumprod(self[, axis, skipna])              cumulative product over a DataFrame or Series axis.
.cumsum(self[, axis, skipna])               cumulative sum over a DataFrame or Series axis.

.is_unique                                  Return boolean if values in the object are unique.
.is_monotonic                               Return boolean if values in the object are monotonic_increasing.
.is_monotonic_increasing                    Return boolean if values in the object are monotonic_increasing.
.is_monotonic_decreasing                    Return boolean if values in the object are monotonic_decreasing.


##### Reindexing / selection / label manipulation


In [None]:
# View
.head(self, n)                              Return the first n rows.
.tail(self, n)                              Return the last n rows.
.sample(self[, n, frac, replace, …])        Return a random sample of items from an axis of object.
.take(self, indices[, axis, is_copy])       Return the elements in the given positional indices along an axis.

# Alter
.rename(self[, index, axis, copy, …])       Alter Series index labels or name.
.rename_axis(self[, mapper, index, …])      Set the name of the axis for the index or columns.
.set_axis(self, labels[, axis, inplace])    Assign desired index to given axis.
.where(self, cond[, other, inplace, …])     Replace values where the condition is False.
.mask(self, cond[, other, inplace, …])      Replace values where the condition is True.
.add_prefix(self, prefix)                   Prefix labels with string prefix.
.add_suffix(self, suffix)                   Suffix labels with string suffix.

# Filter and Find
.filter(self[, items, axis])                Subset, according to the specified index labels.
.truncate(self[, before, after, axis])      Truncate a Series or DataFrame before and after some index value.
.idxmax(self[, axis, skipna])               Return the row label of the maximum value.
.idxmin(self[, axis, skipna])               Return the row label of the minimum value.
.isin(self, values)                         Check whether values are contained in Series.
.equals(self, other)                        Test whether two objects contain the same elements.
.duplicated(self[, keep])                   Indicate duplicate values.
.get(self, key[, default])      Get item from object for given key (ex: DataFrame column).
.at                             Access a single value for a row/column label pair.
.iat                            Access a single value for a row/column pair by integer position.
.loc                            Access a group of rows and columns by label(s) or a boolean array.
.iloc                           Purely integer-location based indexing for selection by position.

# Remove
.drop(self[, labels, axis, index, …])       New dataset with specified index labels removed.
.droplevel(self, level[, axis])             DataFrame with requested index / column level(s) removed.
.drop_duplicates(self[, keep, inplace])     New dataset with duplicate values removed.

##### Missing Data Handling

In [None]:
.isna(self)                                 Detect missing values.
.notna(self)                                Detect existing (non-missing) values.
.dropna(self[, axis, inplace, how])         Return a new Series with missing values removed.
.fillna(self[, value, method, axis, …])     Fill NA/NaN values using the specified method.
.interpolate(self[, method, axis, …])       Interpolate values according to different methods

##### Reshaping, sorting


In [None]:
.argsort(self[, axis, kind, order])         Override ndarray.argsort.
.sort_values(self[, axis, ascending, …])    Sort by the values.
.sort_index(self[, axis, level, …])         Sort Series by index labels.

.argmin(self[, axis, skipna])               Return a ndarray of the minimum argument indexer.
.argmax(self[, axis, skipna])               Return an ndarray of the maximum argument indexer.

.repeat(self, repeats[, axis])              Repeat elements of a Series.
.reorder_levels(self, order)                Rearrange index levels using input order.
.view(self[, dtype])                        Create a new view of the Series.


##### Combining / joining / merging


In [None]:
.append(self, to_append[, …])              Concatenate two or more Series.
.replace(self[, to_replace, value, …])     Replace values given in to_replace with value.
.update(self, other)                       Modify using non-NA values from passed Series.


##### Time series-related


In [None]:
.asfreq(self, freq[, method, fill_value])  Convert TimeSeries to specified frequency.
.asof(self, where[, subset])               Return the last row(s) without any NaNs before where.
.shift(self[, periods, freq, axis, …])     Shift index by desired number of periods with an optional time freq.
.first_valid_index(self)                   Return index for first non-NA/null value.
.last_valid_index(self)                    Return index for last non-NA/null value.
.resample(self, rule[, axis, loffset, …])  Resample time-series data.
.tz_convert(self, tz[, axis, level])       Convert tz-aware axis to target time zone.
.tz_localize(self, tz[, axis, level, …])   Localize tz-naive index of a Series or DataFrame to target time zone.
.at_time(self, time, asof[, axis])         Select values at particular time of day (e.g.
.between_time(self, start_time, …[, …])    Select values between particular times of the day (e.g., 9:00-9:30 AM).
.tshift(self, periods[, freq, axis])       Shift the time index, using the index’s frequency if available.
.slice_shift(self, periods[, axis])        Equivalent to shift without copying data.

# Pandas Simple Operations

## Creating a Series and DataFrame

Pandas Series is a 1 dimension labelled array

Axis labels are called 'index' (similar to a column in excel)

Labels don't need to be unique

- In the real world, a Pandas Series will be created by loading the datasets from existing storage (csv, sql, etc)
- Pandas Series can be created from the lists, dictionary, and from a scalar value etc. 
- Here is a python based example below

In [69]:
# The simplest way to create a Series
Ser1 = pd.Series([1,2,3,4], index = ['USA','Germany','USSR','Japan'])
Ser1



USA        1
Germany    2
USSR       3
Japan      4
dtype: int64

In [273]:
# Simplest way to create a DataFramce
# to note, syntx here super important
# and order as well (index has to be after the data)

df1 = pd.DataFrame({'A' : ['A0','A1','A2'],'B' : ['B0','B1','B2']},
                   index= ['K0','K0','K1'],)
df1

Unnamed: 0,A,B
K0,A0,B0
K0,A1,B1
K1,A2,B2


In [12]:
# From an Array
import pandas as pd 
import numpy as np
 
data = np.array(['g','e','e','k','s'])
ser = pd.Series(data)
print(ser)

data2 = {'Name':['Tom', 'nick', 'krish', 'jack'], 'Age':[20, 21, 19, 18]} 
df = pd.DataFrame(data2) 
print(df)

0    g
1    e
2    e
3    k
4    s
dtype: object
    Name  Age
0    Tom   20
1   nick   21
2  krish   19
3   jack   18


In [16]:
# From a List
import pandas as pd
 
list1 = ['g', 'e', 'e', 'k', 's']
ser = pd.Series(list1)
print(ser)

print('\n')

list2 = [['tom', 10], ['nick', 15], ['juli', 14]] 
df = pd.DataFrame(list2, columns = ['Name', 'Age']) 
print(df)

0    g
1    e
2    e
3    k
4    s
dtype: object


   Name  Age
0   tom   10
1  nick   15
2  juli   14


In [17]:
# Creating Dataframe from list of dicts

data = [{'a': 1, 'b': 2, 'c':3}, {'a':10, 'b': 20, 'c': 30}] 
df = pd.DataFrame(data) 
df 

Unnamed: 0,a,b,c
0,1,2,3
1,10,20,30


In [18]:
# Creating DataFrame using zip() function.

Name = ['tom', 'krish', 'nick', 'juli']  
Age = [25, 30, 26, 22]  
     
list_of_tuples = list(zip(Name, Age))  
df = pd.DataFrame(list_of_tuples, columns = ['Name', 'Age'])  
df  

Unnamed: 0,Name,Age
0,tom,25
1,krish,30
2,nick,26
3,juli,22


## Exploring the DataFrame and Series

In [None]:
df.info                                       columns, data type, memory usage, etc
df.shape                                      dataframe rows x col
df.type                                       provides column names and data type
df['Col1'].describe()                         provides info and stats on col
df['Col1'].value_counts()                     provides count of diff values in the column
df['Col1'].value_counts(normalize=True)       provides % for each value 
df['Col1'].value_counts().head               
df['Col1'].unique()                           list of unique values
df['Col1'].nunique()                          count of unique values

df.describe()                                 provides basic stats (min/max/perc/stdev, etc)
df.describe().loc['min':'max']                provides a description just from min to max
df.describe().loc['min':'max'],['col1':'col2']  provides description from min to max for col1 to col2


## Accessing Elements

There are two ways through which we can access element of series, they are :
- Accessing Element from Series with Position
- Accessing Element Using Label (index) 

In [4]:
# From position
data = np.array(['g','e','e','k','s','f', 'o','r','g','e','e','k','s'])
ser = pd.Series(data)
print(ser[:3])


0    g
1    e
2    e
dtype: object


In [8]:
# From Label
# In order to access an element from series, we have to set values by index label. 
# A Series is like a fixed-size dictionary in that you can get and set values by index label.

data = np.array(['g','e','e','k','s','f', 'o','r','g','e','e','k','s'])
ser = pd.Series(data,index=[10,11,12,13,14,15,16,17,18,19,20,21,22])

print(ser[16])

o


## Changing Index Values or Label

##### Changing the index values with data

In [103]:
# Method 1: changing with data
data2 = {'Name':['Tom', 'nick', 'krish', 'jack'], 'Age':[20, 21, 19, 18]} 
data3 = ['rank1', 'rank2', 'rank3', 'rank4'] 
df = pd.DataFrame(data2) 
df.index = data3
df

Unnamed: 0,Name,Age
rank1,Tom,20
rank2,nick,21
rank3,krish,19
rank4,jack,18


In [None]:
# Method2
df = pd.DataFrame(data, index =['rank1', 'rank2', 'rank3', 'rank4']) 

##### Changing the index Label

In [None]:
# Method 1: set_index
df.set_index('States')

In [None]:
# Method 2: rename
df_new = df.rename(columns={'A': 'a'}, index={'ONE': 'one'})

## Selecting Data in Series

##### Selection from the data

In [104]:
Ser1 = pd.Series([1,2,3,4], index = ['USA','Germany','USSR','Japan'])

#this is the same output
print(Ser1['USA'])
print(Ser1[0])

1
1


##### Finding the row that contains a value : str.contains

In [None]:
df2 = df[df['col_name'].str.contains('stuff_we_search')]
df2 = df[df['col1'].str.contains("stuff_1") | df['col2'].str.contains('stuff_2')] 


##### Finding the row that contains the max of a column

In [None]:
df2 = df[df['col_name'] == sal['col_name'].max()]


#####  Pandas standard: iloc and loc

In [None]:
iloc is for positions

df.iloc[[3,2,3]]              will return rows 3,2,3
df.iloc[1:3]                  will return rows 1 to 3
df.iloc[[True,False]]         will return rows 1 

In [None]:
loc is for labels  - becareful because next version will not allow loc['a'] anymore

FOR COLUMNS
df.loc[:,'a']                      will return col 'a'
df.loc[:,['b','d']]                will return col 'd' and 'd'
df.loc[:, ['b':'d']]               will return col 'b' to 'd'
df.loc[[True,False]]               will return rows 1 

###### Example 1 for loc and iloc (quite long, with changes of col names as well)

In [105]:
# creating dataframe from a csv import
import pandas as pd

df = pd.read_csv("nba.csv")  

# creating Series 
Ser1 = pd.Series(df['Name']) 
data1 = Ser1.head(10)
data1

0    Avery Bradley
1      Jae Crowder
2     John Holland
3      R.J. Hunter
4    Jonas Jerebko
5     Amir Johnson
6    Jordan Mickey
7     Kelly Olynyk
8     Terry Rozier
9     Marcus Smart
Name: Name, dtype: object

In [99]:
# creating Dataframe with 3 columns 
import pandas as pd

df = pd.read_csv("nba.csv") 
Data3 = pd.DataFrame(df, columns = ['Name', 'Team', 'Salary'])

#changing the index 
Data3.index = Name
Data3

Unnamed: 0_level_0,Name,Team,Salary
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Avery Bradley,Avery Bradley,Boston Celtics,7730337.0
Jae Crowder,Jae Crowder,Boston Celtics,6796117.0
John Holland,John Holland,Boston Celtics,
R.J. Hunter,R.J. Hunter,Boston Celtics,1148640.0
Jonas Jerebko,Jonas Jerebko,Boston Celtics,5000000.0
...,...,...,...
Trey Lyles,Trey Lyles,Utah Jazz,2239800.0
Shelvin Mack,Shelvin Mack,Utah Jazz,2433333.0
Raul Neto,Raul Neto,Utah Jazz,900000.0
Tibor Pleiss,Tibor Pleiss,Utah Jazz,2900000.0


In [100]:
#Rename Columns labels (that is not really part of loc/iloc but just an exercise)
Data3.rename(columns = {'Name':'FullName', 'Salary':'USDSalary'},inplace=True)
Data3

Unnamed: 0_level_0,FullName,Team,USDSalary
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Avery Bradley,Avery Bradley,Boston Celtics,7730337.0
Jae Crowder,Jae Crowder,Boston Celtics,6796117.0
John Holland,John Holland,Boston Celtics,
R.J. Hunter,R.J. Hunter,Boston Celtics,1148640.0
Jonas Jerebko,Jonas Jerebko,Boston Celtics,5000000.0
...,...,...,...
Trey Lyles,Trey Lyles,Utah Jazz,2239800.0
Shelvin Mack,Shelvin Mack,Utah Jazz,2433333.0
Raul Neto,Raul Neto,Utah Jazz,900000.0
Tibor Pleiss,Tibor Pleiss,Utah Jazz,2900000.0


In [106]:
#For ROWS- using .loc[] function 
Data3.loc[['Avery Bradley','Jae Crowder','Tibor Pleiss']]

Unnamed: 0_level_0,FullName,Team,USDSalary
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Avery Bradley,Avery Bradley,Boston Celtics,7730337.0
Jae Crowder,Jae Crowder,Boston Celtics,6796117.0
Tibor Pleiss,Tibor Pleiss,Utah Jazz,2900000.0


In [109]:
#For COLUMNS- using .loc[] function 
Data3.loc[:,('FullName','Team')]

Unnamed: 0_level_0,FullName,Team
Name,Unnamed: 1_level_1,Unnamed: 2_level_1
Avery Bradley,Avery Bradley,Boston Celtics
Jae Crowder,Jae Crowder,Boston Celtics
John Holland,John Holland,Boston Celtics
R.J. Hunter,R.J. Hunter,Boston Celtics
Jonas Jerebko,Jonas Jerebko,Boston Celtics
...,...,...
Trey Lyles,Trey Lyles,Utah Jazz
Shelvin Mack,Shelvin Mack,Utah Jazz
Raul Neto,Raul Neto,Utah Jazz
Tibor Pleiss,Tibor Pleiss,Utah Jazz


In [108]:
#using .iloc[] function
Data3.iloc[3:6]

Unnamed: 0_level_0,FullName,Team,USDSalary
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
R.J. Hunter,R.J. Hunter,Boston Celtics,1148640.0
Jonas Jerebko,Jonas Jerebko,Boston Celtics,5000000.0
Amir Johnson,Amir Johnson,Boston Celtics,12000000.0


In [117]:
#For Columns - using .iloc[] function 
Data3.iloc[:,[1,2]]

Unnamed: 0_level_0,Team,USDSalary
Name,Unnamed: 1_level_1,Unnamed: 2_level_1
Avery Bradley,Boston Celtics,7730337.0
Jae Crowder,Boston Celtics,6796117.0
John Holland,Boston Celtics,
R.J. Hunter,Boston Celtics,1148640.0
Jonas Jerebko,Boston Celtics,5000000.0
...,...,...
Trey Lyles,Utah Jazz,2239800.0
Shelvin Mack,Utah Jazz,2433333.0
Raul Neto,Utah Jazz,900000.0
Tibor Pleiss,Utah Jazz,2900000.0


##### Fast Scalar Access: at, iat

Pandas recommends that for fast access of scalar values: at() and iat()
- With at(): row, column labels
- With iat(): row, columns positions (integer)

In [None]:
df.at['a','A']                  will return element at row 'a' col 'A'
df.iat['a','A']                 will return element at row 'a' col 'A'

##### Different Ways To Select Columns

In [196]:
a = pd.DataFrame(np.random.randn(5,4),index=['A','B','C','D','E'],columns=['W','X','Y','Z'])
a

Unnamed: 0,W,X,Y,Z
A,0.147027,-0.479448,0.558769,1.02481
B,-0.925874,1.862864,-1.133817,0.610478
C,0.38603,2.084019,-0.376519,0.230336
D,0.681209,1.035125,-0.03116,1.939932
E,-1.005187,-0.74179,0.187125,-0.732845


In [187]:
# just use the column name
a['Y']

A   -0.943406
B    0.238127
C   -1.136645
D   -0.031579
E   -0.755325
Name: Y, dtype: float64

In [188]:
# for several columns
a[['Y','W']]

Unnamed: 0,Y,W
A,-0.943406,-0.497104
B,0.238127,-0.116773
C,-1.136645,-0.993263
D,-0.031579,1.025984
E,-0.755325,2.154846


In [197]:
# This works as well (not recommended, this is SQL)
a.W  

A    0.147027
B   -0.925874
C    0.386030
D    0.681209
E   -1.005187
Name: W, dtype: float64

Pandas recommends that for fast access of scalar values: at() and iat()
- With at(): row, column labels
- With iat(): row, columns positions (integer)

In [None]:
df.at['a','A']                  will return element at row 'a' col 'A'
df.iat['a','A']                 will return element at row 'a' col 'A'

##### Take a Few Rows: take

Both Series and DataFrame support a method take() which accepts a list of indices and returns rows at those indices.

Possible to take columns by using axis = 1

In [None]:
df.take([0, 3, 4])              will return rows 0, 3, 4
df.take([0, 3], axis = 1)       will return cols 0, 3

##### Slicing across rows and columns

The standard Python array slice syntax will work, even though pandas doc recommends more efficient row access methods, such as loc - iloc

In [None]:
df[:4]     for the 1st 4 rows
df[-4:]    for the lat 4 rows
df[2:4]    for a range of rows

In [218]:
a = pd.DataFrame([[1.0,5.0,1],[2,np.NaN,2],[np.NaN,np.NaN,3]], columns = ['A','B','C'])
b = a[['A','C']]
b.iloc[[0,1]]

Unnamed: 0,A,C
0,1.0,1
1,2.0,2


In [219]:
a = pd.DataFrame(np.random.randn(5,4),index=['A','B','C','D','E'],columns=['W','X','Y','Z'])
a[:2][['W','Y']]    #select the rows, then select the columns

Unnamed: 0,W,Y
A,0.302665,-1.706086
B,-0.134841,0.166905


## Adding / Removing / Renaming Columns

##### Adding a new column (at end) with a list, or array

In [132]:
c = ['a','b','c','d','e']
a['new'] = c
a

Unnamed: 0,W,X,Y,Z,new
A,-0.765488,1.106956,-1.322725,-0.03244,a
B,-2.075747,0.098703,-0.795067,1.109205,b
C,-0.55975,0.849862,-1.300241,0.427709,c
D,1.438867,-1.171476,-1.589255,0.025656,d
E,-0.344111,-0.371495,-0.055213,-0.852237,e


##### Inserting a new column

In [133]:
d = ['a1','b1','c1','d1','e1']
a.insert(2,"newest",d, True)
a

Unnamed: 0,W,X,newest,Y,Z,new
A,-0.765488,1.106956,a1,-1.322725,-0.03244,a
B,-2.075747,0.098703,b1,-0.795067,1.109205,b
C,-0.55975,0.849862,c1,-1.300241,0.427709,c
D,1.438867,-1.171476,d1,-1.589255,0.025656,d
E,-0.344111,-0.371495,e1,-0.055213,-0.852237,e


##### Adding a new column using .assign

In [134]:
a2 = a.assign(A = [45,56,78,89,12])
a2

Unnamed: 0,W,X,newest,Y,Z,new,A
A,-0.765488,1.106956,a1,-1.322725,-0.03244,a,45
B,-2.075747,0.098703,b1,-0.795067,1.109205,b,56
C,-0.55975,0.849862,c1,-1.300241,0.427709,c,78
D,1.438867,-1.171476,d1,-1.589255,0.025656,d,89
E,-0.344111,-0.371495,e1,-0.055213,-0.852237,e,12


##### Removing a column

In [None]:
del df['colname']
df.drop('colname', axis=1)                    # removes the col, just for display
df.drop('colname', axis=1, inplace=True)      # inplace removes the col permanently

##### Renaming a column

In [None]:
# method 1 : rename
df.rename(columns = {'old column name' : 'new column name', 'oldcol': newcol}, inplace=True)

In [None]:
# method 2 - just give the new column names
def_col_name = ['new column name', 'newcol']
df.columns = def_col_name

In [None]:
# just to modify some strings (here, replacing space by _)
df.columns = df.columns.str.replace('','_')

##### Reversing Order

In [None]:
df.loc[::-1]                           reverse row order
df.loc[::-1].reset_index(drop=True)    reverse rows and index order
df.loc[:,::-1]                         reverse columns order


## Binary Operations 

We can perform binary operation on series like addition, subtraction and many other operation. 

In order to perform binary operation on series we have to use some function like .add(),.sub() etc..

In [None]:
to note:

sum is different from sum()
- sum will do sum of rows
- sum() will do sum of columns

#### COMMENT ####

List of Operations:

- add()	add series or list like objects with same length to the caller series
- sub()	subtract series or list like objects with same length from the caller series
- mul()	multiply series or list like objects with same length with the caller series
- div()	divide series or list like objects with same length by the caller series
- sum()	sum of the values for the requested axis
- prod() product of the values for the requested axis
- mean() mean of the values for the requested axis
- pow()	exponential power of caller series and returned the results
- abs()	absolute numeric value of each element in Series/DataFrame
- cov()	covariance of two series

Example of syntax

add(self, other, axis='columns', level=None, fill_value=None)

Parameters
   - other: scalar, sequence, Series, or DataFrame
   - axis: {0 or ‘index’, 1 or ‘columns’}
   - level: int or label - Broadcast across a level, matching Index values on the passed MultiIndex level.
   - fill_value: float or None, default None
     - Fill existing missing (NaN) values with this value before computation. 
     - If data in both corresponding DataFrame locations is missing the result will be missing.


In [1]:
import pandas as pd  
data = pd.Series([5, 2, 3,7], index=['a', 'b', 'c', 'd'])
data1 = pd.Series([1, 6, 4, 9], index=['a', 'b', 'd', 'e'])

print(data, "\n\n", data1)

a    5
b    2
c    3
d    7
dtype: int64 

 a    1
b    6
d    4
e    9
dtype: int64


In [20]:
data.add(data1, fill_value=0)

a     6.0
b     8.0
c     3.0
d    11.0
e     9.0
dtype: float64

In [22]:
data.add(data1)

a     6.0
b     8.0
c     NaN
d    11.0
e     NaN
dtype: float64

In [216]:
# To note, when using loc, there is need sometimes to pass 2 arguments (row, column)

a = pd.DataFrame([[1.0,5.0,1],[2,np.NaN,2],[np.NaN,np.NaN,3]], columns = ['W','Y','Z'])
a['new'] = a.loc[:, 'W'] + a.loc[:, 'Y']
a

Unnamed: 0,W,Y,Z,new
0,1.0,5.0,1,6.0
1,2.0,,2,
2,,,3,


## Custom Functions

##### Using Lambda and apply

In [None]:
dfsq = df['col1'].apply(lambda x: x * x)
dflen = df['col3'].apply(lambda x: len(x))

## Conversion Operations

In conversion operation we perform various operation like changing datatype of series
In order to perform conversion operation we have various function which help in conversion like .astype(), .tolist() .to_numeric



### To Numeric

##### Syntax to_numeric

In [None]:
pandas.to_numeric(arg, errors='raise', downcast=None)

- arg       scalar, list, tuple, 1-d array, or Series
- errors    {‘ignore’, ‘raise’, ‘coerce’}, default ‘raise’
    - If ‘raise’, then invalid parsing will raise an exception.
    - If ‘coerce’, then invalid parsing will be set as NaN.
    - If ‘ignore’, then invalid parsing will return the input.
    - downcast{‘integer’, ‘signed’, ‘unsigned’, ‘float’}, default None
         - smallest numerical dtype possible according to the following rules:
         - ‘integer’ or ‘signed’: smallest signed int dtype (min.: np.int8)
         - ‘unsigned’: smallest unsigned int dtype (min.: np.uint8)
         - ‘float’: smallest float dtype (min.: np.float32)


##### Examples

In [32]:
# Example
df = pd.DataFrame({'A': ['100','200','python','300.12','400'],
                   'B': ['50',2000,'banana',100.52,'600']   })
df


Unnamed: 0,A,B
0,100,50
1,200,2000
2,python,banana
3,300.12,100.52
4,400,600


In [33]:
# Here, 'coerce' will make sure that any error will result in NaN
a = pd.to_numeric(df['A'], errors='coerce')
a

0    100.00
1    200.00
2       NaN
3    300.12
4    400.00
Name: A, dtype: float64

In [36]:
a = pd.to_numeric(df['A'], errors='coerce').fillna=0
a

0

In [37]:
df = df.apply(pd.to_numeric, errors='coerce').fillna=0
df

AttributeError: 'int' object has no attribute 'apply'

### Astype

##### Syntax astype

In [None]:
DataFrame.astype(self, dtype, copy: bool = True, errors: str = 'raise') 

- dtypedata type, or dict of column name
   - float, int, str, etc
   - Or {col: dtype, …}, where col is a column label / dtype is a numpy.dtype or Python type 
- copybool, default True
   - Return a copy when copy=True
- errors{‘raise’, ‘ignore’}, default ‘raise’
   - Control raising of exceptions on invalid data for provided dtype.
   - raise : allow exceptions to be raised
   - ignore : suppress exceptions. On error return original object.

##### Example

In [39]:
df = pd.DataFrame({ 'A': [1, 2, 3, 4, 5], 
                   'B': ['a', 'b', 'c', 'd', 'e'], 
                   'C': [1.1, '1.0', '1.3', 2, 5] }) 
  
# using dictionary to convert specific columns 
convert_dict = {'A': int, 'B': str, 'C': float} 
  
df = df.astype(convert_dict) 
print(df.dtypes) 

A      int32
B     object
C    float64
dtype: object


## Sorting Values

In [None]:
df.sort_values(by=['col1'])                                           # basic
df.sort_values(by=['col1', 'col2'])                                   # multiple columns
df.sort_values(by='col1', ascending=False)                            # descending order
df.sort_values(by='col1', ascending=False, na_position='first')       # puts NAs first