![title](http://tripkendall.com/wp-content/uploads/2018/01/pandas_logo-1080x675.jpg)

## Pandas library 

Pandas is an open source library that provides easy-to-use data structures and data analysis tools in Python. This library offers a very useful data structure called the Dataframes.

# pandas series

A Series is very similar to a NumPy array (in fact it is built on top of the NumPy array object). What differentiates the NumPy array from a Series, is that a Series can have axis labels, meaning it can be indexed by a label, instead of just a number location. It also doesn't need to hold numeric data, it can hold any arbitrary Python Object

In [309]:
import pandas as pd
import numpy as np

### Creating a Series

You can convert a list,numpy array, or dictionary to a Series:

In [310]:
list=[1,2,3,4]
labels=['a','b','c','d']
array=np.array(list)
dict={'a':1,'b':2,'c':3,'d':4}
print(list)
print(labels)
print(array)
print(dict)

[1, 2, 3, 4]
['a', 'b', 'c', 'd']
[1 2 3 4]
{'a': 1, 'b': 2, 'c': 3, 'd': 4}


A pandas Series can hold a variety of object types:

In [311]:
pd.Series(data=list)

0    1
1    2
2    3
3    4
dtype: int64

In [312]:
pd.Series(data=list,index=labels)

a    1
b    2
c    3
d    4
dtype: int64

In [313]:
pd.Series(data=labels,index=list)

1    a
2    b
3    c
4    d
dtype: object

In [314]:
pd.Series(array,index=labels)

a    1
b    2
c    3
d    4
dtype: int32

In [315]:
pd.Series(dict)

a    1
b    2
c    3
d    4
dtype: int64

### Indexing in series 

In [316]:
ser1 = pd.Series([80,20,90,40],index = ['USA', 'Germany','India', 'Japan'])     
ser1

USA        80
Germany    20
India      90
Japan      40
dtype: int64

In [317]:
ser2 = pd.Series([100,2,5,4],index = ['USA', 'Germany','India', 'Korea'])
ser2

USA        100
Germany      2
India        5
Korea        4
dtype: int64

In [318]:
ser1['India']


90

In [319]:
ser2['USA']

100

### pandas series operations

In [320]:
ser1+ser2

Germany     22.0
India       95.0
Japan        NaN
Korea        NaN
USA        180.0
dtype: float64

In [321]:
print(ser1.index)
print(ser1.values)
pd.isnull(ser1)

Index(['USA', 'Germany', 'India', 'Japan'], dtype='object')
[80 20 90 40]


USA        False
Germany    False
India      False
Japan      False
dtype: bool

In [322]:
ser1[ser1>20]

USA      80
India    90
Japan    40
dtype: int64

In [323]:
'India' in ser1

True

In [324]:
#converting series to dict
dictser1=ser1.to_dict()
dictser1

{'Germany': 20, 'India': 90, 'Japan': 40, 'USA': 80}

# Pandas DataFrames 

DataFrame as a bunch of Series objects put together to share the same index

Dataframes is defined as a 2-dimensional labeled data structure with columns of potentially different types according to pydata.org. It is similar to a spreadsheet or SQL table, or a dict of Series objects

![title](https://s3.amazonaws.com/rfv2/dataframe.png)

use pd.DataFrame to create a dataframe df, As we know dataframe is a series of serie, dataframe has(values,indexs,columns)

In [325]:
import pandas as pd
import numpy as np
values=np.random.randint(1,100,(5,4))#genrarting 20 random values
labels=['a','b','c','d','e']
columns=['w','x','y','z']
df=pd.DataFrame(data=values,index=labels,columns=columns) 
 #data values in df is index*column values

In [326]:
df

Unnamed: 0,w,x,y,z
a,63,11,3,31
b,16,20,29,49
c,17,94,17,47
d,74,22,97,88
e,97,27,98,6


## indexing in dataframes (selecting columns)

In [327]:
#Selecting column
df['w']

a    63
b    16
c    17
d    74
e    97
Name: w, dtype: int32

In [328]:
#selecting mulitple columns
df[['w','y']]

Unnamed: 0,w,y
a,63,3
b,16,29
c,17,17
d,74,97
e,97,98


In [329]:
#creating new column
df['new']=df['x'] + df['y']
df

Unnamed: 0,w,x,y,z,new
a,63,11,3,31,14
b,16,20,29,49,49
c,17,94,17,47,111
d,74,22,97,88,119
e,97,27,98,6,125


### removing or droping a column &row


In [330]:
df

Unnamed: 0,w,x,y,z,new
a,63,11,3,31,14
b,16,20,29,49,49
c,17,94,17,47,111
d,74,22,97,88,119
e,97,27,98,6,125


In [331]:
#droping column 'new'
df.drop('new',axis=1,inplace=True) #axis=1 represents column and inplace=True to change the condition


In [332]:
df

Unnamed: 0,w,x,y,z
a,63,11,3,31
b,16,20,29,49
c,17,94,17,47
d,74,22,97,88
e,97,27,98,6


In [333]:
#droping a row 'e'
df.drop('e',axis=0,inplace=True) #axis=0 represents row of dataframe


In [334]:
df

Unnamed: 0,w,x,y,z
a,63,11,3,31
b,16,20,29,49
c,17,94,17,47
d,74,22,97,88


In [335]:
df.shape #4 rows and 4 columns

(4, 4)

## Selecting rows in a DataFrame 

use loc (loc is based on indexlabels) or iloc (iloc is based on postion) to select ROWS 

In [336]:
df

Unnamed: 0,w,x,y,z
a,63,11,3,31
b,16,20,29,49
c,17,94,17,47
d,74,22,97,88


In [337]:
#single row(loc)
df.loc['a']

w    63
x    11
y     3
z    31
Name: a, dtype: int32

In [338]:
#multiple rows(loc)
df.loc[['a','b']]

Unnamed: 0,w,x,y,z
a,63,11,3,31
b,16,20,29,49


In [339]:
#single row(iloc)
df.iloc[0]

w    63
x    11
y     3
z    31
Name: a, dtype: int32

In [340]:
#multiple rows(iloc)
df.iloc[[0,1]]


Unnamed: 0,w,x,y,z
a,63,11,3,31
b,16,20,29,49


### Selecting Subsets of Rows and Columns 

In [341]:
df

Unnamed: 0,w,x,y,z
a,63,11,3,31
b,16,20,29,49
c,17,94,17,47
d,74,22,97,88


In [342]:
#grab value 41
df.loc['b','x']

20

In [343]:
df.loc[['a'],['x','z']] # here a is rows and x,z are columns

Unnamed: 0,x,z
a,11,31


In [344]:
df.loc[['a','b','c'],['x','z']]

Unnamed: 0,x,z
a,11,31
b,20,49
c,94,47


## Conditional selection of df 

In [345]:
df

Unnamed: 0,w,x,y,z
a,63,11,3,31
b,16,20,29,49
c,17,94,17,47
d,74,22,97,88


In [346]:
df[df>50]

Unnamed: 0,w,x,y,z
a,63.0,,,
b,,,,
c,,94.0,,
d,74.0,,97.0,88.0


In [347]:
df[df['x']>50]

Unnamed: 0,w,x,y,z
c,17,94,17,47


In [348]:
df[df['z']>50]

Unnamed: 0,w,x,y,z
d,74,22,97,88


In [349]:
df[df['z']>50]['x']

d    22
Name: x, dtype: int32

In [350]:
df[df['z']>50][['x','y']]

Unnamed: 0,x,y
d,22,97


In [351]:
df[(df['x']>50)&(df['y']<50)]

Unnamed: 0,w,x,y,z
c,17,94,17,47


In [352]:
df[(df['x']>50)|(df['y']<50)]

Unnamed: 0,w,x,y,z
a,63,11,3,31
b,16,20,29,49
c,17,94,17,47


### reseting the index 

In [353]:
#adding new df column to set a newindex
df['new']='pr an ee th'.split() #'pr an ee th' ==['pr','an','ee','th']
df

Unnamed: 0,w,x,y,z,new
a,63,11,3,31,pr
b,16,20,29,49,an
c,17,94,17,47,ee
d,74,22,97,88,th


In [354]:
#reset index with new df column
df.set_index('new',inplace=True)
df

Unnamed: 0_level_0,w,x,y,z
new,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
pr,63,11,3,31
an,16,20,29,49
ee,17,94,17,47
th,74,22,97,88


### Missing data in the DataFrame

In [355]:
newdf=df[df>40]
newdf

Unnamed: 0_level_0,w,x,y,z
new,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
pr,63.0,,,
an,,,,49.0
ee,,94.0,,47.0
th,74.0,,97.0,88.0


dropna(axis=0) drops all the rows which has missing values(NAN)

In [356]:
newdf.dropna(axis=0)

Unnamed: 0_level_0,w,x,y,z
new,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1


dropna(axis=1) drops all the columns which has missing values

In [357]:
newdf.dropna(axis=1)

pr
an
ee
th


dropna(thresh=3) will drop rows which has more than 2 nullvalues 

In [358]:
newdf.dropna(thresh=3)

Unnamed: 0_level_0,w,x,y,z
new,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
th,74.0,,97.0,88.0


## fillna() used to fill the missing value 

In [359]:
newdf

Unnamed: 0_level_0,w,x,y,z
new,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
pr,63.0,,,
an,,,,49.0
ee,,94.0,,47.0
th,74.0,,97.0,88.0


In [360]:
newdf.fillna('bunny')

Unnamed: 0_level_0,w,x,y,z
new,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
pr,63,bunny,bunny,bunny
an,bunny,bunny,bunny,49
ee,bunny,94,bunny,47
th,74,bunny,97,88


In [361]:
newdf.fillna(value=newdf['w'].mean())

Unnamed: 0_level_0,w,x,y,z
new,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
pr,63.0,68.5,68.5,68.5
an,68.5,68.5,68.5,49.0
ee,68.5,94.0,68.5,47.0
th,74.0,68.5,97.0,88.0


# GroupBy in pandas DataFrame   

The groupby method allows you to group rows of data together and call aggregate functions


In [362]:
# Create dataframe
data = {'Company':['GOOG','GOOG','MSFT','MSFT','FB','FB'],
       'Person':['Sam','Charlie','Amy','Vanessa','Carl','Sarah'],
       'Sales':[200,120,340,124,243,350]}

In [363]:
df1=pd.DataFrame(data)
df1

Unnamed: 0,Company,Person,Sales
0,GOOG,Sam,200
1,GOOG,Charlie,120
2,MSFT,Amy,340
3,MSFT,Vanessa,124
4,FB,Carl,243
5,FB,Sarah,350


In [364]:
df1_company=df1.groupby('Company')
df1_company #groups all the numerical values(Sales) by company

<pandas.core.groupby.DataFrameGroupBy object at 0x0000011ECD1FAC50>

In [365]:
df1_company.mean()

Unnamed: 0_level_0,Sales
Company,Unnamed: 1_level_1
FB,296.5
GOOG,160.0
MSFT,232.0


In [366]:
df1_company.sum()

Unnamed: 0_level_0,Sales
Company,Unnamed: 1_level_1
FB,593
GOOG,320
MSFT,464


In [367]:
df1_company.max()

Unnamed: 0_level_0,Person,Sales
Company,Unnamed: 1_level_1,Unnamed: 2_level_1
FB,Sarah,350
GOOG,Sam,200
MSFT,Vanessa,340


In [368]:
df1_company.count()

Unnamed: 0_level_0,Person,Sales
Company,Unnamed: 1_level_1,Unnamed: 2_level_1
FB,2,2
GOOG,2,2
MSFT,2,2


In [369]:
df1_company.describe()

Unnamed: 0_level_0,Sales,Sales,Sales,Sales,Sales,Sales,Sales,Sales
Unnamed: 0_level_1,count,mean,std,min,25%,50%,75%,max
Company,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2
FB,2.0,296.5,75.660426,243.0,269.75,296.5,323.25,350.0
GOOG,2.0,160.0,56.568542,120.0,140.0,160.0,180.0,200.0
MSFT,2.0,232.0,152.735065,124.0,178.0,232.0,286.0,340.0


In [370]:
df1_company.describe().transpose()

Unnamed: 0,Company,FB,GOOG,MSFT
Sales,count,2.0,2.0,2.0
Sales,mean,296.5,160.0,232.0
Sales,std,75.660426,56.568542,152.735065
Sales,min,243.0,120.0,124.0
Sales,25%,269.75,140.0,178.0
Sales,50%,296.5,160.0,232.0
Sales,75%,323.25,180.0,286.0
Sales,max,350.0,200.0,340.0


# Merge,Joining and Concating in dataframes 

In [371]:
adf=df[df['z']>50]
adf

Unnamed: 0_level_0,w,x,y,z
new,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
th,74,22,97,88


In [372]:
bdf=df[df['x']>50]
bdf

Unnamed: 0_level_0,w,x,y,z
new,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
ee,17,94,17,47


### Concatenation basically glues together DataFrames. Keep in mind that dimensions should match along the axis you are concatenating on. You can use pd.concat and pass in a list of DataFrames to concatenate together:



In [373]:
#concatenation
pd.concat([adf,bdf])

Unnamed: 0_level_0,w,x,y,z
new,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
th,74,22,97,88
ee,17,94,17,47


In [374]:
#axis=1
pd.concat([adf,bdf],axis=1)

Unnamed: 0,w,x,y,z,w.1,x.1,y.1,z.1
ee,,,,,17.0,94.0,17.0,47.0
th,74.0,22.0,97.0,88.0,,,,


### The merge function allows you to merge DataFrames together using a similar logic as merging SQL Tables together. For example: 

In [375]:
left = pd.DataFrame({'key1': ['K0', 'K0', 'K1', 'K2'],
                     'key2': ['K0', 'K1', 'K0', 'K1'],
                        'A': ['A0', 'A1', 'A2', 'A3'],
                        'B': ['B0', 'B1', 'B2', 'B3']})
    
right = pd.DataFrame({'key1': ['K0', 'K1', 'K1', 'K2'],
                               'key2': ['K0', 'K0', 'K0', 'K0'],
                                  'C': ['C0', 'C1', 'C2', 'C3'],
                                  'D': ['D0', 'D1', 'D2', 'D3']})

In [376]:
left

Unnamed: 0,A,B,key1,key2
0,A0,B0,K0,K0
1,A1,B1,K0,K1
2,A2,B2,K1,K0
3,A3,B3,K2,K1


In [377]:
right

Unnamed: 0,C,D,key1,key2
0,C0,D0,K0,K0
1,C1,D1,K1,K0
2,C2,D2,K1,K0
3,C3,D3,K2,K0


In [378]:
pd.merge(left, right, on=['key1', 'key2'])

Unnamed: 0,A,B,key1,key2,C,D
0,A0,B0,K0,K0,C0,D0
1,A2,B2,K1,K0,C1,D1
2,A2,B2,K1,K0,C2,D2


In [379]:
pd.merge(left, right, how='right', on=['key1', 'key2'])

Unnamed: 0,A,B,key1,key2,C,D
0,A0,B0,K0,K0,C0,D0
1,A2,B2,K1,K0,C1,D1
2,A2,B2,K1,K0,C2,D2
3,,,K2,K0,C3,D3


In [380]:
pd.merge(left, right, how='outer', on=['key1', 'key2'])

Unnamed: 0,A,B,key1,key2,C,D
0,A0,B0,K0,K0,C0,D0
1,A1,B1,K0,K1,,
2,A2,B2,K1,K0,C1,D1
3,A2,B2,K1,K0,C2,D2
4,A3,B3,K2,K1,,
5,,,K2,K0,C3,D3


### Joining is a convenient method for combining the columns of two potentially differently-indexed DataFrames into a single result DataFrame 

In [381]:
left = pd.DataFrame({'A': ['A0', 'A1', 'A2'],
                     'B': ['B0', 'B1', 'B2']},
                      index=['K0', 'K1', 'K2']) 

right = pd.DataFrame({'C': ['C0', 'C2', 'C3'],
                    'D': ['D0', 'D2', 'D3']},
                      index=['K0', 'K2', 'K3'])

In [382]:
left

Unnamed: 0,A,B
K0,A0,B0
K1,A1,B1
K2,A2,B2


In [383]:
right

Unnamed: 0,C,D
K0,C0,D0
K2,C2,D2
K3,C3,D3


In [384]:
left.join(right)

Unnamed: 0,A,B,C,D
K0,A0,B0,C0,D0
K1,A1,B1,,
K2,A2,B2,C2,D2


In [385]:
left.join(right, how='outer')

Unnamed: 0,A,B,C,D
K0,A0,B0,C0,D0
K1,A1,B1,,
K2,A2,B2,C2,D2
K3,,,C3,D3


# Operations on dataframes 

In [386]:
data = pd.DataFrame({'col1':[1,2,3,4],'col2':[444,555,666,444],'col3':['abc','def','ghi','xyz']})
data.head()
#head() shows the first 5 values

Unnamed: 0,col1,col2,col3
0,1,444,abc
1,2,555,def
2,3,666,ghi
3,4,444,xyz


In [387]:
data['col2'].unique()
#show unique values

array([444, 555, 666], dtype=int64)

In [388]:
data['col2'].nunique()
#shows n unique values

3

In [389]:
data['col2'].value_counts

<bound method IndexOpsMixin.value_counts of 0    444
1    555
2    666
3    444
Name: col2, dtype: int64>

# Apply our own fucntions to columns in dataframe(.apply()) 

In [390]:
data['col2'].values

array([444, 555, 666, 444], dtype=int64)

In [391]:
def times2(x):
    return x*2

In [392]:
data['col2'].apply(times2)

0     888
1    1110
2    1332
3     888
Name: col2, dtype: int64

# Data input and output 

## CSV


In [393]:
csvdf = pd.read_csv('https://raw.githubusercontent.com/colaberry/538data/master/college-majors/grad-students.csv')
csvdf

Unnamed: 0,Major_code,Major,Major_category,Grad_total,Grad_sample_size,Grad_employed,Grad_full_time_year_round,Grad_unemployed,Grad_unemployment_rate,Grad_median,...,Nongrad_total,Nongrad_employed,Nongrad_full_time_year_round,Nongrad_unemployed,Nongrad_unemployment_rate,Nongrad_median,Nongrad_P25,Nongrad_P75,Grad_share,Grad_premium
0,5601,CONSTRUCTION SERVICES,Industrial Arts & Consumer Services,9173,200,7098,6511,681,0.087543,75000.0,...,86062,73607,62435,3928,0.050661,65000.0,47000,98000.0,0.096320,0.153846
1,6004,COMMERCIAL ART AND GRAPHIC DESIGN,Arts,53864,882,40492,29553,2482,0.057756,60000.0,...,461977,347166,250596,25484,0.068386,48000.0,34000,71000.0,0.104420,0.250000
2,6211,HOSPITALITY MANAGEMENT,Business,24417,437,18368,14784,1465,0.073867,65000.0,...,179335,145597,113579,7409,0.048423,50000.0,35000,75000.0,0.119837,0.300000
3,2201,COSMETOLOGY SERVICES AND CULINARY ARTS,Industrial Arts & Consumer Services,5411,72,3590,2701,316,0.080901,47000.0,...,37575,29738,23249,1661,0.052900,41600.0,29000,60000.0,0.125878,0.129808
4,2001,COMMUNICATION TECHNOLOGIES,Computers & Mathematics,9109,171,7512,5622,466,0.058411,57000.0,...,53819,43163,34231,3389,0.072800,52000.0,36000,78000.0,0.144753,0.096154
5,3201,COURT REPORTING,Law & Public Policy,1542,22,1008,860,0,0.000000,75000.0,...,8921,6967,6063,518,0.069205,50000.0,34000,75000.0,0.147376,0.500000
6,6206,MARKETING AND MARKETING RESEARCH,Business,190996,3738,151570,123045,8324,0.052059,80000.0,...,1029181,817906,662346,45519,0.052719,60000.0,40000,91500.0,0.156531,0.333333
7,1101,AGRICULTURE PRODUCTION AND MANAGEMENT,Agriculture & Natural Resources,17488,386,13104,11207,473,0.034838,67000.0,...,89169,71781,61335,1869,0.025377,55000.0,38000,80000.0,0.163965,0.218182
8,2101,COMPUTER PROGRAMMING AND DATA PROCESSING,Computers & Mathematics,5611,98,4716,3981,119,0.024612,85000.0,...,28314,22024,18381,2222,0.091644,60000.0,40000,85000.0,0.165394,0.416667
9,1904,ADVERTISING AND PUBLIC RELATIONS,Communications & Journalism,33928,688,28517,22523,899,0.030562,60000.0,...,163435,127832,100330,8706,0.063762,51000.0,37800,78000.0,0.171907,0.176471


In [394]:
df

Unnamed: 0_level_0,w,x,y,z
new,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
pr,63,11,3,31
an,16,20,29,49
ee,17,94,17,47
th,74,22,97,88


### converting df to csv with df.to_csv() 

In [395]:
df.to_csv(index=False)

'w,x,y,z\n63,11,3,31\n16,20,29,49\n17,94,17,47\n74,22,97,88\n'

### converting df to excel with df.to_excel() 


## HTML

You may need to install htmllib5,lxml, and BeautifulSoup4. In your terminal/command prompt run:

    conda install lxml
    conda install html5lib
    conda install BeautifulSoup4

Then restart Jupyter Notebook.
(or use pip install if you aren't using the Anaconda Distribution)

Pandas can read table tabs off of html. For example:

# Example:1 (DataFrame)  

In [398]:
import pandas as pd
import numpy as np
data=pd.read_csv('https://raw.githubusercontent.com/colaberry/538data/master/college-majors/grad-students.csv')

In [399]:
data

Unnamed: 0,Major_code,Major,Major_category,Grad_total,Grad_sample_size,Grad_employed,Grad_full_time_year_round,Grad_unemployed,Grad_unemployment_rate,Grad_median,...,Nongrad_total,Nongrad_employed,Nongrad_full_time_year_round,Nongrad_unemployed,Nongrad_unemployment_rate,Nongrad_median,Nongrad_P25,Nongrad_P75,Grad_share,Grad_premium
0,5601,CONSTRUCTION SERVICES,Industrial Arts & Consumer Services,9173,200,7098,6511,681,0.087543,75000.0,...,86062,73607,62435,3928,0.050661,65000.0,47000,98000.0,0.096320,0.153846
1,6004,COMMERCIAL ART AND GRAPHIC DESIGN,Arts,53864,882,40492,29553,2482,0.057756,60000.0,...,461977,347166,250596,25484,0.068386,48000.0,34000,71000.0,0.104420,0.250000
2,6211,HOSPITALITY MANAGEMENT,Business,24417,437,18368,14784,1465,0.073867,65000.0,...,179335,145597,113579,7409,0.048423,50000.0,35000,75000.0,0.119837,0.300000
3,2201,COSMETOLOGY SERVICES AND CULINARY ARTS,Industrial Arts & Consumer Services,5411,72,3590,2701,316,0.080901,47000.0,...,37575,29738,23249,1661,0.052900,41600.0,29000,60000.0,0.125878,0.129808
4,2001,COMMUNICATION TECHNOLOGIES,Computers & Mathematics,9109,171,7512,5622,466,0.058411,57000.0,...,53819,43163,34231,3389,0.072800,52000.0,36000,78000.0,0.144753,0.096154
5,3201,COURT REPORTING,Law & Public Policy,1542,22,1008,860,0,0.000000,75000.0,...,8921,6967,6063,518,0.069205,50000.0,34000,75000.0,0.147376,0.500000
6,6206,MARKETING AND MARKETING RESEARCH,Business,190996,3738,151570,123045,8324,0.052059,80000.0,...,1029181,817906,662346,45519,0.052719,60000.0,40000,91500.0,0.156531,0.333333
7,1101,AGRICULTURE PRODUCTION AND MANAGEMENT,Agriculture & Natural Resources,17488,386,13104,11207,473,0.034838,67000.0,...,89169,71781,61335,1869,0.025377,55000.0,38000,80000.0,0.163965,0.218182
8,2101,COMPUTER PROGRAMMING AND DATA PROCESSING,Computers & Mathematics,5611,98,4716,3981,119,0.024612,85000.0,...,28314,22024,18381,2222,0.091644,60000.0,40000,85000.0,0.165394,0.416667
9,1904,ADVERTISING AND PUBLIC RELATIONS,Communications & Journalism,33928,688,28517,22523,899,0.030562,60000.0,...,163435,127832,100330,8706,0.063762,51000.0,37800,78000.0,0.171907,0.176471


In [400]:
data.head()

Unnamed: 0,Major_code,Major,Major_category,Grad_total,Grad_sample_size,Grad_employed,Grad_full_time_year_round,Grad_unemployed,Grad_unemployment_rate,Grad_median,...,Nongrad_total,Nongrad_employed,Nongrad_full_time_year_round,Nongrad_unemployed,Nongrad_unemployment_rate,Nongrad_median,Nongrad_P25,Nongrad_P75,Grad_share,Grad_premium
0,5601,CONSTRUCTION SERVICES,Industrial Arts & Consumer Services,9173,200,7098,6511,681,0.087543,75000.0,...,86062,73607,62435,3928,0.050661,65000.0,47000,98000.0,0.09632,0.153846
1,6004,COMMERCIAL ART AND GRAPHIC DESIGN,Arts,53864,882,40492,29553,2482,0.057756,60000.0,...,461977,347166,250596,25484,0.068386,48000.0,34000,71000.0,0.10442,0.25
2,6211,HOSPITALITY MANAGEMENT,Business,24417,437,18368,14784,1465,0.073867,65000.0,...,179335,145597,113579,7409,0.048423,50000.0,35000,75000.0,0.119837,0.3
3,2201,COSMETOLOGY SERVICES AND CULINARY ARTS,Industrial Arts & Consumer Services,5411,72,3590,2701,316,0.080901,47000.0,...,37575,29738,23249,1661,0.0529,41600.0,29000,60000.0,0.125878,0.129808
4,2001,COMMUNICATION TECHNOLOGIES,Computers & Mathematics,9109,171,7512,5622,466,0.058411,57000.0,...,53819,43163,34231,3389,0.0728,52000.0,36000,78000.0,0.144753,0.096154


In [401]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 173 entries, 0 to 172
Data columns (total 22 columns):
Major_code                      173 non-null int64
Major                           173 non-null object
Major_category                  173 non-null object
Grad_total                      173 non-null int64
Grad_sample_size                173 non-null int64
Grad_employed                   173 non-null int64
Grad_full_time_year_round       173 non-null int64
Grad_unemployed                 173 non-null int64
Grad_unemployment_rate          173 non-null float64
Grad_median                     173 non-null float64
Grad_P25                        173 non-null int64
Grad_P75                        173 non-null float64
Nongrad_total                   173 non-null int64
Nongrad_employed                173 non-null int64
Nongrad_full_time_year_round    173 non-null int64
Nongrad_unemployed              173 non-null int64
Nongrad_unemployment_rate       173 non-null float64
Nongrad_median    

In [402]:
data.describe()

Unnamed: 0,Major_code,Grad_total,Grad_sample_size,Grad_employed,Grad_full_time_year_round,Grad_unemployed,Grad_unemployment_rate,Grad_median,Grad_P25,Grad_P75,Nongrad_total,Nongrad_employed,Nongrad_full_time_year_round,Nongrad_unemployed,Nongrad_unemployment_rate,Nongrad_median,Nongrad_P25,Nongrad_P75,Grad_share,Grad_premium
count,173.0,173.0,173.0,173.0,173.0,173.0,173.0,173.0,173.0,173.0,173.0,173.0,173.0,173.0,173.0,173.0,173.0,173.0,173.0,173.0
mean,3879.815029,127672.0,2250.872832,94037.034682,72861.184971,3506.427746,0.039343,76755.780347,52596.508671,112087.34104,214720.3,154553.5,120736.8,8486.323699,0.053947,58583.815029,40078.179191,84332.947977,0.400595,0.328505
std,1687.75314,219551.2,3805.923082,159723.860054,123153.615862,5909.87145,0.019076,16912.102488,10896.842595,30266.551533,399680.0,290066.3,233525.6,16135.491564,0.019329,15028.468079,9509.017523,20861.431281,0.165964,0.185805
min,1100.0,1542.0,22.0,1008.0,770.0,0.0,0.0,47000.0,24500.0,65000.0,2232.0,1328.0,980.0,0.0,0.0,37000.0,25000.0,48000.0,0.09632,-0.025
25%,2403.0,15284.0,314.0,12659.0,9894.0,453.0,0.026068,65000.0,45000.0,93000.0,20564.0,15914.0,11755.0,880.0,0.041981,48700.0,34000.0,72000.0,0.267567,0.230769
50%,3608.0,37872.0,688.0,28930.0,22523.0,1179.0,0.036654,75000.0,50000.0,108000.0,68993.0,50092.0,38384.0,3157.0,0.051031,55000.0,38000.0,80000.0,0.398745,0.320755
75%,5503.0,148255.0,2528.0,109944.0,80794.0,3329.0,0.048051,90000.0,60000.0,130000.0,184971.0,129179.0,103629.0,7409.0,0.064387,65000.0,44000.0,97000.0,0.499117,0.4
max,6403.0,1184158.0,21994.0,915341.0,703347.0,35718.0,0.138515,135000.0,85000.0,294000.0,2996892.0,2253649.0,1882507.0,136978.0,0.160907,126000.0,80000.0,215000.0,0.931175,1.647059


### Now, try listing first 7 rows of the midwest_survey dataframe using the .head() function and assign it to midwest_sample?

In [406]:
midwest_sample=data.head(7)
midwest_sample

Unnamed: 0,Major_code,Major,Major_category,Grad_total,Grad_sample_size,Grad_employed,Grad_full_time_year_round,Grad_unemployed,Grad_unemployment_rate,Grad_median,...,Nongrad_total,Nongrad_employed,Nongrad_full_time_year_round,Nongrad_unemployed,Nongrad_unemployment_rate,Nongrad_median,Nongrad_P25,Nongrad_P75,Grad_share,Grad_premium
0,5601,CONSTRUCTION SERVICES,Industrial Arts & Consumer Services,9173,200,7098,6511,681,0.087543,75000.0,...,86062,73607,62435,3928,0.050661,65000.0,47000,98000.0,0.09632,0.153846
1,6004,COMMERCIAL ART AND GRAPHIC DESIGN,Arts,53864,882,40492,29553,2482,0.057756,60000.0,...,461977,347166,250596,25484,0.068386,48000.0,34000,71000.0,0.10442,0.25
2,6211,HOSPITALITY MANAGEMENT,Business,24417,437,18368,14784,1465,0.073867,65000.0,...,179335,145597,113579,7409,0.048423,50000.0,35000,75000.0,0.119837,0.3
3,2201,COSMETOLOGY SERVICES AND CULINARY ARTS,Industrial Arts & Consumer Services,5411,72,3590,2701,316,0.080901,47000.0,...,37575,29738,23249,1661,0.0529,41600.0,29000,60000.0,0.125878,0.129808
4,2001,COMMUNICATION TECHNOLOGIES,Computers & Mathematics,9109,171,7512,5622,466,0.058411,57000.0,...,53819,43163,34231,3389,0.0728,52000.0,36000,78000.0,0.144753,0.096154
5,3201,COURT REPORTING,Law & Public Policy,1542,22,1008,860,0,0.0,75000.0,...,8921,6967,6063,518,0.069205,50000.0,34000,75000.0,0.147376,0.5
6,6206,MARKETING AND MARKETING RESEARCH,Business,190996,3738,151570,123045,8324,0.052059,80000.0,...,1029181,817906,662346,45519,0.052719,60000.0,40000,91500.0,0.156531,0.333333


###  

### Load rows 2 to 6 of column Major_category and assign it to variable major_category.
print out the variable major_category

In [412]:
data['Major_category'][1:6]

1                                   Arts
2                               Business
3    Industrial Arts & Consumer Services
4                Computers & Mathematics
5                    Law & Public Policy
Name: Major_category, dtype: object

In [416]:
data.iloc[[2,3,4,5,6],[2]]

Unnamed: 0,Major_category
2,Business
3,Industrial Arts & Consumer Services
4,Computers & Mathematics
5,Law & Public Policy
6,Business


In [418]:
data.loc[[2,3,4,5,6],['Major_category']]

Unnamed: 0,Major_category
2,Business
3,Industrial Arts & Consumer Services
4,Computers & Mathematics
5,Law & Public Policy
6,Business


### List the Major Category of all the data where the employment percentage is greater than 80%.

Assign it to the variable major_emp_data and print it out.

In [423]:
#creating new column with employee percentage
data['emp_percent'] = (data.Grad_employed/data.Grad_total) * 100.0
# data['emp_percent']


In [425]:
major_emp_data=data['Major_category'][data['emp_percent']>80.0]
major_emp_data

4                  Computers & Mathematics
8                  Computers & Mathematics
9              Communications & Journalism
13             Communications & Journalism
15                 Computers & Mathematics
17                             Engineering
20                     Law & Public Policy
23                                Business
24                 Computers & Mathematics
25                                Business
27                                  Health
28                 Computers & Mathematics
30                                Business
34                             Engineering
38                 Computers & Mathematics
40                       Interdisciplinary
41                       Physical Sciences
45                Psychology & Social Work
46                                    Arts
50     Industrial Arts & Consumer Services
51                                Business
60                 Computers & Mathematics
63                  Biology & Life Science
66         

###  Can you determine what is the max graduates employed in each Major category?

Assign the result to the variable, grad_cat_max and print it out.

In [426]:
grad_cat_max = data.groupby('Major_category').Grad_employed.max()
grad_cat_max


Major_category
Agriculture & Natural Resources         47755
Arts                                   150394
Biology & Life Science                 898342
Business                               622357
Communications & Journalism            212156
Computers & Mathematics                287467
Education                              698049
Engineering                            371723
Health                                 437115
Humanities & Liberal Arts              598806
Industrial Arts & Consumer Services    103790
Interdisciplinary                       12708
Law & Public Policy                    154146
Physical Sciences                      336838
Psychology & Social Work               915341
Social Science                         548199
Name: Grad_employed, dtype: int64

# Categorical and dummy variables 

![titel](https://refactored.ai/user/rf385/files/images/types_of_data.png)

### Transform the 'sex' column to a categorical variable and append to the dataframe, checks. 

In [427]:
raw_data = {'Day': ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday','Sunday', 'Monday', 'Tuesday', 'Wednesday'],
                      'Sex': ['male', 'female', 'male', 'female', 'female', 'female', 'female', 'male', 'male', 'female'],
                      'Total_Amount': ['100.0', '30', '40', '70', '90', '20', '50', '30','60', '70']}
checks = pd.DataFrame(raw_data, columns = ['Total_Amount', 'Day', 'Sex'])

# Add code to create a new dummy variable out of Sex column and append to checks data frame.
sex_dummies = pd.get_dummies(checks['Sex'])
checks = pd.concat((checks, sex_dummies), axis=1)

In [428]:
checks

Unnamed: 0,Total_Amount,Day,Sex,female,male
0,100.0,Monday,male,0,1
1,30.0,Tuesday,female,1,0
2,40.0,Wednesday,male,0,1
3,70.0,Thursday,female,1,0
4,90.0,Friday,female,1,0
5,20.0,Saturday,female,1,0
6,50.0,Sunday,female,1,0
7,30.0,Monday,male,0,1
8,60.0,Tuesday,male,0,1
9,70.0,Wednesday,female,1,0
