# Pandas

pandas is a Python package providing fast, flexible, and expressive data structures designed to work with relational or labeled data both. It is a fundamental high-level building block for doing practical, real world data analysis in Python.

In [3]:
import pandas as pd
import numpy as np


Series
A Series is a single vector of data (like a NumPy array) with an index that labels each element in the vector.

In [4]:
counts = pd.Series([10,20,30,40,50])
counts

0    10
1    20
2    30
3    40
4    50
dtype: int64

In [5]:
#If an index is not specified, a default sequence of integers is assigned as the index.
#A NumPy array comprises the values of the Series, while the index is a pandas Index object.
counts.values

array([10, 20, 30, 40, 50], dtype=int64)

In [6]:
#We can assign meaningful labels to the index
count = pd.Series(["JAINISH","SIDDHESH","KUNWAR","KAPIL"],index=["Cyber Security","Game Designer","VFX Artist",
                                                                 "Software Developer"])
count

Cyber Security         JAINISH
Game Designer         SIDDHESH
VFX Artist              KUNWAR
Software Developer       KAPIL
dtype: object

In [7]:
#These labels can be used to refer to the values in the Series.
count["Cyber Security"]

'JAINISH'

In [8]:
#for finding endswith
count[[name.endswith('r') for name in count.index]]

Game Designer         SIDDHESH
Software Developer       KAPIL
dtype: object

In [9]:
#boolean for finding endswith
[name.endswith('r') for name in count.index]

[False, True, False, True]

In [10]:
#We can still use positional indexing
count[0]

'JAINISH'

In [11]:
#We can still use positional indexing
count[-1]

'KAPIL'

In [12]:
#You can also give index a name in Series 
count.index.name = 'Careers'
count

Careers
Cyber Security         JAINISH
Game Designer         SIDDHESH
VFX Artist              KUNWAR
Software Developer       KAPIL
dtype: object

In [13]:
#Another way to write series and it can be boolean
registrations = [True, False, False, True, True]
pd.Series(registrations)

0     True
1    False
2    False
3     True
4     True
dtype: bool

In [14]:
#Series as dictionary 
dictionary = {"Harrison" : "Cyber Security",
             "Sam":"Assistant"}
pd.Series(dictionary)

Harrison    Cyber Security
Sam              Assistant
dtype: object

In [15]:
#To know the datatype
count.dtype

dtype('O')

In [16]:
#Maths Operation
numeric = pd.Series([10,20,405,1020,23,45,99.9])
numeric

0      10.0
1      20.0
2     405.0
3    1020.0
4      23.0
5      45.0
6      99.9
dtype: float64

In [17]:
#Sum 
numeric.sum()

1622.9

In [18]:
#Mean
numeric.mean()

231.84285714285716

In [19]:
#Median
numeric.median()

45.0

In [20]:
#NumPy's math functions and other operations can be applied to Series without losing the data structure.
np.log(numeric)

0    2.302585
1    2.995732
2    6.003887
3    6.927558
4    3.135494
5    3.806662
6    4.604170
dtype: float64

In [21]:
#Another way of adding a index and data 
fruits = ["Apple", "Orange", "Plum", "Grape", "Blueberry"]
weekdays = ["Monday", "Tuesday", "Wednesday", "Thursday", "Friday"]
pd.Series(fruits,weekdays)

Monday           Apple
Tuesday         Orange
Wednesday         Plum
Thursday         Grape
Friday       Blueberry
dtype: object

In [22]:
pd.Series(index=weekdays,data=fruits)

Monday           Apple
Tuesday         Orange
Wednesday         Plum
Thursday         Grape
Friday       Blueberry
dtype: object

In [23]:
numeric

0      10.0
1      20.0
2     405.0
3    1020.0
4      23.0
5      45.0
6      99.9
dtype: float64

In [24]:
counts

0    10
1    20
2    30
3    40
4    50
dtype: int64

In [25]:
#Addition of two series 
numeric + counts

0      20.0
1      40.0
2     435.0
3    1060.0
4      73.0
5       NaN
6       NaN
dtype: float64

DataFrame
Inevitably, we want to be able to store, view and manipulate data that is multivariate, where for every index there are multiple fields or columns of data, often of varying data type.

A DataFrame is a tabular data structure, encapsulating multiple series like columns in a spreadsheet. Data are stored internally as a 2-dimensional object, but the DataFrame allows us to represent and manipulate higher-dimensional data.

In [26]:
data = pd.DataFrame({'salary':[70000,100000,60000,50000,85000],
                    'year':[1,1,1,1,1],
                    'career':["Cyber Security","CEO","Game Designer","Business Analysis","Software Developer"],
                    'person':["Jainish","Sam","Siddhesh","Ruchi","Kapil"]})
data

Unnamed: 0,salary,year,career,person
0,70000,1,Cyber Security,Jainish
1,100000,1,CEO,Sam
2,60000,1,Game Designer,Siddhesh
3,50000,1,Business Analysis,Ruchi
4,85000,1,Software Developer,Kapil


In [27]:
# DataFrame is sorted by column name. We can change the order by indexing them in the order we desire:
data[['person','career','year','salary']]

Unnamed: 0,person,career,year,salary
0,Jainish,Cyber Security,1,70000
1,Sam,CEO,1,100000
2,Siddhesh,Game Designer,1,60000
3,Ruchi,Business Analysis,1,50000
4,Kapil,Software Developer,1,85000


In [28]:
#DataFrame has a second index, representing the columns:
data.columns

Index(['salary', 'year', 'career', 'person'], dtype='object')

In [29]:
#we wish to access columns, we can do so either by dict-like indexing or by attribute:
data['salary']

0     70000
1    100000
2     60000
3     50000
4     85000
Name: salary, dtype: int64

In [30]:
#If we want access to a row in a DataFrame, we index its ix attribute.
data.ix[1]

.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
  


salary    100000
year           1
career       CEO
person       Sam
Name: 1, dtype: object

In [31]:
#To change someone salary 
income = data.salary
income

0     70000
1    100000
2     60000
3     50000
4     85000
Name: salary, dtype: int64

In [34]:
#It will affect on the orginial data
income[3] = 0
income

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


0     70000
1    100000
2     60000
3         0
4     85000
Name: salary, dtype: int64

In [35]:
data

Unnamed: 0,salary,year,career,person
0,70000,1,Cyber Security,Jainish
1,100000,1,CEO,Sam
2,60000,1,Game Designer,Siddhesh
3,0,1,Business Analysis,Ruchi
4,85000,1,Software Developer,Kapil


In [37]:
#We can create or modify columns by assignment:
data.salary[2] = 50000
data

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


Unnamed: 0,salary,year,career,person
0,70000,1,Cyber Security,Jainish
1,100000,1,CEO,Sam
2,50000,1,Game Designer,Siddhesh
3,0,1,Business Analysis,Ruchi
4,85000,1,Software Developer,Kapil


In [38]:
#To add a columns and to have same value to all data
data['current year'] = 2019
data

Unnamed: 0,salary,year,career,person,current year
0,70000,1,Cyber Security,Jainish,2019
1,100000,1,CEO,Sam,2019
2,50000,1,Game Designer,Siddhesh,2019
3,0,1,Business Analysis,Ruchi,2019
4,85000,1,Software Developer,Kapil,2019


In [39]:
#deletion 
del data['current year']

In [40]:
data

Unnamed: 0,salary,year,career,person
0,70000,1,Cyber Security,Jainish
1,100000,1,CEO,Sam
2,50000,1,Game Designer,Siddhesh
3,0,1,Business Analysis,Ruchi
4,85000,1,Software Developer,Kapil


In [42]:
data.birthyear = 2019
data

Unnamed: 0,salary,year,career,person
0,70000,1,Cyber Security,Jainish
1,100000,1,CEO,Sam
2,50000,1,Game Designer,Siddhesh
3,0,1,Business Analysis,Ruchi
4,85000,1,Software Developer,Kapil


In [49]:
#If you want to add columns 
#Specifying a Series as a new columns cause its values to be added according to the DataFrame's index:
birthyear = pd.Series([1999]*2 + [2000]*2 + [1990]*1)

In [50]:
data['birthyear'] = birthyear
data

Unnamed: 0,salary,year,career,person,birthyear
0,70000,1,Cyber Security,Jainish,1999
1,100000,1,CEO,Sam,1999
2,50000,1,Game Designer,Siddhesh,2000
3,0,1,Business Analysis,Ruchi,2000
4,85000,1,Software Developer,Kapil,1990


In [51]:
print("Thank you")

Thank you
