<img src="../../Pics/MLSb-T.png" width="160">
<br><br>
<center><u><H1>Pandas Data Structures and Operations</H1></u></center>

### Pandas Library has the following main data structures:

1.Series

2.DataFrames

<u><H2>SERIES:</H2></u>

In [1]:
import pandas as pd
import numpy as np

In [2]:
pd.Series(np.random.randn(6))

0   -0.899000
1   -0.439944
2    1.140025
3   -0.783033
4    1.689482
5   -0.634700
dtype: float64

### The index of the series can be customized with index values(letters or numbers):

In [3]:
pd.Series(np.random.randn(4), index=[1,2,3,4])

1   -0.814353
2   -0.210134
3   -0.564870
4    1.908261
dtype: float64

### We also can use a Python dict:

In [4]:
d = {'A': 6, 'B': 3, 'C': 'abc'}
pd.Series(d)

A      6
B      3
C    abc
dtype: object

<u><H2>DataFrame:</H2><u>

<b>DataFrame is a 2D data structure with columns of different datatypes and rows are named index. It can be formed from the following data structures:</b>
1. Numpy array
2. Lists
3. Dicts
4. Series
5. 2D numpy array

In [5]:
#using dict of series
d = {'column_1': pd.Series([1,2,3]),
    'column_2': pd.Series(['aw',8,'ty'])}
df = pd.DataFrame(d)
df

Unnamed: 0,column_1,column_2
0,1,aw
1,2,8
2,3,ty


In [6]:
#using dict of lists
d = {'column_1': [1,2,3],
    'column_2': ['aw',8,'ty'],
    'column_3': [4,5,6]}
df = pd.DataFrame(d)
df

Unnamed: 0,column_1,column_2,column_3
0,1,aw,4
1,2,8,5
2,3,ty,6


### Selection and Indexing:

In [7]:
# selecting a column
df['column_1']

0    1
1    2
2    3
Name: column_1, dtype: int64

In [8]:
# selecting more than one column
df[['column_1','column_2']]

Unnamed: 0,column_1,column_2
0,1,aw
1,2,8
2,3,ty


### <u>loc and iloc:</u>
#### -loc works on labels in the index.
#### -iloc works on the positions in the index (so it only takes integers).

In [9]:
#selecting rows
df.loc[1]

column_1    2
column_2    8
column_3    5
Name: 1, dtype: object

In [10]:
df.iloc[0]

column_1     1
column_2    aw
column_3     4
Name: 0, dtype: object

In [11]:
type(df['column_1'])

pandas.core.series.Series

### Inserting new column

In [12]:
df['new'] = df['column_1'] + df['column_3']
df

Unnamed: 0,column_1,column_2,column_3,new
0,1,aw,4,5
1,2,8,5,7
2,3,ty,6,9


### Deleting a column

In [13]:
df.drop('new',axis=1,inplace=True) # use inplace to make changes permanent
df

Unnamed: 0,column_1,column_2,column_3
0,1,aw,4
1,2,8,5
2,3,ty,6


### Selecting a subset of the dataframe with rows and columns

In [14]:
df.loc[1,'column_1']

2

In [15]:
df.loc[[0,2],['column_2','column_3']]

Unnamed: 0,column_2,column_3
0,aw,4
2,ty,6


### Selection by condition:

In [16]:
df[df['column_1']>1][['column_2','column_3']]

Unnamed: 0,column_2,column_3
1,8,5
2,ty,6


In [17]:
df[(df['column_1']>1) & (df['column_3'] > 5)]

Unnamed: 0,column_1,column_2,column_3
2,3,ty,6


### Index properties:

In [18]:
#Array of index values
df.index.values

array([0, 1, 2], dtype=int64)

In [19]:
#Using the split function of strings to have a list of items
a = '1 b 45'.split()

In [20]:
#Inserting new column since list values
df['column_4'] = a
df

Unnamed: 0,column_1,column_2,column_3,column_4
0,1,aw,4,1
1,2,8,5,b
2,3,ty,6,45


In [21]:
#Inserting new row
df.loc[3]=[6,432,'zxy','cv']
df

Unnamed: 0,column_1,column_2,column_3,column_4
0,1,aw,4,1
1,2,8,5,b
2,3,ty,6,45
3,6,432,zxy,cv


<u><H2>OPERATIONS:</H2><u/>

In [22]:
df = pd.DataFrame({'col1':[11,42,33,14,25],'col2':[444,555,666,454,452],'col3':['abc','def','gci','xcz','wec']})
df.head(3)

Unnamed: 0,col1,col2,col3
0,11,444,abc
1,42,555,def
2,33,666,gci


In [23]:
df['col2'].sum()

2571

In [24]:
(df['col2']==444).sum()

1

In [25]:
df['col2'].count()

5

In [26]:
df['col2'].value_counts()

454    1
452    1
444    1
555    1
666    1
Name: col2, dtype: int64

In [27]:
df['col2'].values

array([444, 555, 666, 454, 452], dtype=int64)

In [28]:
a = df.columns.values
b = sorted(a)
b

['col1', 'col2', 'col3']

In [29]:
df.sort_values(by='col2')

Unnamed: 0,col1,col2,col3
0,11,444,abc
4,25,452,wec
3,14,454,xcz
1,42,555,def
2,33,666,gci


In [30]:
df.dropna()

Unnamed: 0,col1,col2,col3
0,11,444,abc
1,42,555,def
2,33,666,gci
3,14,454,xcz
4,25,452,wec


In [31]:
df.isnull()

Unnamed: 0,col1,col2,col3
0,False,False,False
1,False,False,False
2,False,False,False
3,False,False,False
4,False,False,False


In [32]:
df.loc[6]=[np.nan,52,np.nan]
df.fillna('value',inplace=True)
df

Unnamed: 0,col1,col2,col3
0,11,444.0,abc
1,42,555.0,def
2,33,666.0,gci
3,14,454.0,xcz
4,25,452.0,wec
6,value,52.0,value


In [33]:
#Reset index
df.reset_index(inplace=True)
df

Unnamed: 0,index,col1,col2,col3
0,0,11,444.0,abc
1,1,42,555.0,def
2,2,33,666.0,gci
3,3,14,454.0,xcz
4,4,25,452.0,wec
5,6,value,52.0,value


In [34]:
#Sort by index
df.sort_index(ascending=False,inplace=True)
df

Unnamed: 0,index,col1,col2,col3
5,6,value,52.0,value
4,4,25,452.0,wec
3,3,14,454.0,xcz
2,2,33,666.0,gci
1,1,42,555.0,def
0,0,11,444.0,abc


In [35]:
#Setting index
df.set_index('col3',inplace=True)
df

Unnamed: 0_level_0,index,col1,col2
col3,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
value,6,value,52.0
wec,4,25,452.0
xcz,3,14,454.0
gci,2,33,666.0
def,1,42,555.0
abc,0,11,444.0


In [36]:
df.reset_index(inplace=True)
df

Unnamed: 0,col3,index,col1,col2
0,value,6,value,52.0
1,wec,4,25,452.0
2,xcz,3,14,454.0
3,gci,2,33,666.0
4,def,1,42,555.0
5,abc,0,11,444.0


### STRING OPERATIONS:

In [37]:
df['col1'].str.extract('(\w+)',expand=False)

0    value
1      NaN
2      NaN
3      NaN
4      NaN
5      NaN
Name: col1, dtype: object

### Filtering:

In [38]:
df[df['col3']=='abc']

Unnamed: 0,col3,index,col1,col2
5,abc,0,11,444.0


### Uppercase/Lowercase:

In [39]:
df['col3'].str.upper()#lower()

0    VALUE
1      WEC
2      XCZ
3      GCI
4      DEF
5      ABC
Name: col3, dtype: object

### Lenght of the string:

In [40]:
df['col3'].str.len()

0    5
1    3
2    3
3    3
4    3
5    3
Name: col3, dtype: int64

### Split:

In [41]:
df['col3'].str.split('c')

0    [value]
1     [we, ]
2     [x, z]
3     [g, i]
4      [def]
5     [ab, ]
Name: col3, dtype: object

### Replace:

In [42]:
df['col3'].str.replace(' ','c')

0    value
1      wec
2      xcz
3      gci
4      def
5      abc
Name: col3, dtype: object

### Contains:

In [43]:
c=df['col3'].str.contains('c')
c

0    False
1     True
2     True
3     True
4    False
5     True
Name: col3, dtype: bool

### ONE HOT ENCONDING:

In [44]:
dfhot = pd.DataFrame({'gender':['male','female','male','female','male'],'age_range':['young','adult','senior','young','adult']})
dfhot

Unnamed: 0,age_range,gender
0,young,male
1,adult,female
2,senior,male
3,young,female
4,adult,male


In [45]:
data_dummies = pd.get_dummies(dfhot)
data_dummies

Unnamed: 0,age_range_adult,age_range_senior,age_range_young,gender_female,gender_male
0,0,0,1,0,1
1,1,0,0,1,0
2,0,1,0,0,1
3,0,0,1,1,0
4,1,0,0,0,1


## Reference:

https://pandas.pydata.org/pandas-docs/stable/dsintro.html

https://pandas.pydata.org/pandas-docs/stable/basics.html