# RDS@GSU - PYTHON & DATA 1: INTRO TO PYTHON FOR DATA ANALYSIS

#### Copyright + References

In [3]:
# The content in this notebook was developed by Jeremy Walker.
# All sample code and notes are provided under a Creative Commons
# ShareAlike license.

# Official Copyright Rules / Restrictions / Priveleges
# Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
# https://creativecommons.org/licenses/by-sa/4.0/

# Datasets used in this workshop are originally sourced from 
# the "CAR" package in R:
# Fox, J., & Weisberg, S. (2019). An R companion to applied
# regression (Third edition). SAGE.

# Part 0 - Using the Python and the Jupyter Notebook Interface

In [4]:
# "#" Represent annotations or comments in Python.  I tend to make heavy use of annotations throughout workshops
# and code.  In this case it is meant to help you understand the code presented.  But you may want to make heavy 
# use of your own annotations in order to make your code more readable, interpretable, and sustainable.


### SHIFT + ENTER
runs the selected cell and advances the cursor to the next cell

### CTRL + ENTER
runs the selected cell, but does NOT advance the cursor to the next cell

### ALT + ENTER
run the selected cell, and inserts a new cell between the current and following cell, and then advances the cursor to the newly inserted cell

In [4]:
# Numbers are simply represented as digits.
3

3

In [5]:
# You can also perform simple math operations by default.
# Be wary order-of-operations.  I advise useing ( ) heavily!
3 + ((3 / 2)**2)

5.25

In [6]:
# Strings (literal text) are represented using single or double quotes: 'text'  "text"
# The output will also reflect this disctinction by including quotes around the
# default output.
"3"

'3'

In [7]:
# You can do some strange operations using strings, but be careful!
3 + 3

6

In [8]:
"3" + "3"

'33'

In [9]:
"3" + "a"

'3a'

In [10]:
# If you attempt to do STUFF (operations, functions, processes) with mismatched types 
# of data, you are likely to get errors.  Don't panic, this will happen a lot as you start
# using Python for the first time. Just treat it as an opportunity to learn and inspect
# your code.

"a" + 3

TypeError: can only concatenate str (not "int") to str

In [12]:
# "Objects", sometimes referred to as "variables" in Python are the core building block for 
# who you will structure and organize your code.  Objects can represent singular values, 
# complex tables of data, statistical models, and many other things.  What is important to 
# understand is that the object can be represented by its own unique name, then than object 
# can be invoked to do stuff and things later on.

# To create or update an object, simply use the "=" with the object on the left and the value/data/information on the right
# EXAMPLE SYNTAX: objectName = value
# EXAMPLE SYNTAX: x = 3

x = 3

In [13]:
# Perform additional operations
x + 99

102

In [14]:
# For now, x remains unchanged
x

3

In [15]:
# You can also update and overwrite objects
x = x + 17

In [17]:
x
#not sure what is going on what the next number calculation it seems to be - wrong? 17 + 3 ? is not 37

20

In [18]:
# Object names can be entirely arbitrary, non-sensical, and unrelated to its underlying components.

harrypotter = "samwise gamgee"

In [19]:
harrypotter

'samwise gamgee'

In [21]:
# The first function worth really knowing about is type(...).
# All this does is looks at the value or object inside of type(...) and tell you "what" it is. 
# The answer might be a simple "str" for string or "int" for integer.  However, you may find references
# to much more complex data and object types.  This can be helpful when debugging your code and trying 
# to resolve errors.

type(harrypotter)

str

# Part 1 - Introduction and opening data files

In [None]:
# In Python, there is an immense number of "modules" that allow users
# to perform a variety of pre-defined tasks and operations.  Learning
# to identify modules that are useful, accessible, and stable is a core
# part of becoming a Python user. For example, in raw Python code, 
# opening a CSV file may take 3-4 lines of code. Meanwhile, using a 
# module named Pandas, the same operation only takes a single line of code. 

In [None]:
# We first need to import and activate a tool called "Pandas" Pandas is
# one of the best and most user-friendly data-analysis modules in Python
# If you are familiar with Excel, SQL, or R, a lot of Pandas utilities
# will feel familiar to you. To import a package, you will follow a 
# standard pattern. 

# NOTE: If you ever get an error saying that "there is no module named....",
# then you will need to open the "Anaconda Prompt" program and run 
# the command "conda install modulename"

In [22]:
# EXAMPLE SYNTAX: import packagename as abbrev

import pandas as pd

In [24]:
# Once you've loaded a module, you can then use its pre-defined functions.
# For example, to tell Pandas to open a CSV file, we simply type 
# pd.read_csv("filename.csv")

# EXAMPLE SYNTAX: abbrev.function(...)

pd.read_csv("salaries.csv")

pd.read_csv

<function pandas.io.parsers.read_csv(filepath_or_buffer: Union[str, pathlib.Path, IO[~AnyStr]], sep=',', delimiter=None, header='infer', names=None, index_col=None, usecols=None, squeeze=False, prefix=None, mangle_dupe_cols=True, dtype=None, engine=None, converters=None, true_values=None, false_values=None, skipinitialspace=False, skiprows=None, skipfooter=0, nrows=None, na_values=None, keep_default_na=True, na_filter=True, verbose=False, skip_blank_lines=True, parse_dates=False, infer_datetime_format=False, keep_date_col=False, date_parser=None, dayfirst=False, cache_dates=True, iterator=False, chunksize=None, compression='infer', thousands=None, decimal: str = '.', lineterminator=None, quotechar='"', quoting=0, doublequote=True, escapechar=None, comment=None, encoding=None, dialect=None, error_bad_lines=True, warn_bad_lines=True, delim_whitespace=False, low_memory=True, memory_map=False, float_precision=None)>

In [28]:
# Pandas has the power to read many different filetypes. Just a note of
# caution, each filetype may have unique requirements or options.  I have
# noted this as "..." in the examples below.  You will need to read the
# documentation to make sure you are importing the data correctly.

# Excel
# pd.read_excel("filename.xlsx",...)

# SAS
# pd.read_sas("filename.sas7bdat",...)
# pd.read_sas("filename.xport",...)

# SPSS
# pd.read_spss("filename.sav",...)

# STATA
# pd.read_stata("filename.dta",...)

In [29]:
# Using the code from above, we can use Pandas (pd) to open a CSV file 
# (read_csv) and then assign all of that information to a new object I will
# call "df". NOTE: "df" is just a common placeholder name you will see when
# looking up support and documentation online. You could just as easily
# use other object names (e.g. "data_wave1" or "raw_data").

df = pd.read_csv("salaries.csv")

In [28]:
df

Unnamed: 0,rank,discipline,yrs_since_phd,yrs_service,sex,salary
0,Prof,B,19,18,Male,139750
1,Prof,B,20,16,Male,173200
2,AsstProf,B,4,3,Male,79750
3,Prof,B,45,39,Male,115000
4,Prof,B,40,41,Male,141500
...,...,...,...,...,...,...
392,Prof,A,33,30,Male,103106
393,Prof,A,31,19,Male,150564
394,Prof,A,42,25,Male,101738
395,Prof,A,25,15,Male,95329


In [30]:
type(df)

pandas.core.frame.DataFrame

# Part 2 -Descriptive Statistics

In [31]:
# Many objects in Python will have "methods" that allow you to do something
# additional to or with the values stored in the object.

# EXAMPLE SYNTAX objectName.methodName(...)

# In the example below, we can use a method called "head" for a DataFrame

df.head()

df.head(7)

Unnamed: 0,rank,discipline,yrs_since_phd,yrs_service,sex,salary
0,Prof,B,19,18,Male,139750
1,Prof,B,20,16,Male,173200
2,AsstProf,B,4,3,Male,79750
3,Prof,B,45,39,Male,115000
4,Prof,B,40,41,Male,141500
5,AssocProf,B,6,6,Male,97000
6,Prof,B,30,23,Male,175000


In [14]:
# You can specify a different number in head(#)
df.head(3)

Unnamed: 0,rank,discipline,yrs_since_phd,yrs_service,sex,salary
0,Prof,B,19,18,Male,139750
1,Prof,B,20,16,Male,173200
2,AsstProf,B,4,3,Male,79750


In [39]:
# The same pattern applies for the method "tail"
df.tail()

Unnamed: 0,rank,discipline,yrs_since_phd,yrs_service,sex,salary
392,Prof,A,33,30,Male,103106
393,Prof,A,31,19,Male,150564
394,Prof,A,42,25,Male,101738
395,Prof,A,25,15,Male,95329
396,AsstProf,A,8,4,Male,81035


In [44]:
df.tail(7)

Unnamed: 0,rank,discipline,yrs_since_phd,yrs_service,sex,salary
390,Prof,A,40,19,Male,166605
391,Prof,A,30,19,Male,151292
392,Prof,A,33,30,Male,103106
393,Prof,A,31,19,Male,150564
394,Prof,A,42,25,Male,101738
395,Prof,A,25,15,Male,95329
396,AsstProf,A,8,4,Male,81035


In [41]:
# "dytpes" will give us some insight into what types of data are encoded
# in the dataframe. Sometimes this will give you clues as to where data 
# is missing, corrupt, or needing conversion.

df.dtypes

rank             object
discipline       object
yrs_since_phd     int64
yrs_service       int64
sex              object
salary            int64
dtype: object

In [45]:
# The "sample" method will draw a number of random samples from the dataframe.

df.sample()

Unnamed: 0,rank,discipline,yrs_since_phd,yrs_service,sex,salary
364,Prof,A,43,43,Male,205500


In [46]:
df.sample(2)

Unnamed: 0,rank,discipline,yrs_since_phd,yrs_service,sex,salary
287,AsstProf,A,2,0,Male,85000
297,Prof,A,17,11,Male,148800


In [47]:
# What if you want to weight samples differently?
# What if you want a proportion of the data, not a rigid integer?
# What if you want reproducible results?

# If you ever want (or need) to find more details about how a function 
# or method works, you can type "?..." 
# (e.g. ?pd.DataFrame  ?df.head   ?pd.read_excel)

?df.sample

[1;31mSignature:[0m
[0mdf[0m[1;33m.[0m[0msample[0m[1;33m([0m[1;33m
[0m    [0mn[0m[1;33m=[0m[1;32mNone[0m[1;33m,[0m[1;33m
[0m    [0mfrac[0m[1;33m=[0m[1;32mNone[0m[1;33m,[0m[1;33m
[0m    [0mreplace[0m[1;33m=[0m[1;32mFalse[0m[1;33m,[0m[1;33m
[0m    [0mweights[0m[1;33m=[0m[1;32mNone[0m[1;33m,[0m[1;33m
[0m    [0mrandom_state[0m[1;33m=[0m[1;32mNone[0m[1;33m,[0m[1;33m
[0m    [0maxis[0m[1;33m=[0m[1;32mNone[0m[1;33m,[0m[1;33m
[0m[1;33m)[0m [1;33m->[0m [1;33m~[0m[0mFrameOrSeries[0m[1;33m[0m[1;33m[0m[0m
[1;31mDocstring:[0m
Return a random sample of items from an axis of object.

You can use `random_state` for reproducibility.

Parameters
----------
n : int, optional
    Number of items from axis to return. Cannot be used with `frac`.
    Default = 1 if `frac` = None.
frac : float, optional
    Fraction of axis items to return. Cannot be used with `n`.
replace : bool, default False
    Allow or disallow sampling

In [32]:
# The 'describe' method gives high-level descriptive statistics.
# By default, it only inspects numeric variables.

df.describe()

Unnamed: 0,yrs_since_phd,yrs_service,salary
count,397.0,397.0,397.0
mean,22.314861,17.61461,113706.458438
std,12.887003,13.006024,30289.038695
min,1.0,0.0,57800.0
25%,12.0,7.0,91000.0
50%,21.0,16.0,107300.0
75%,32.0,27.0,134185.0
max,56.0,60.0,231545.0


In [33]:
# You can always dive deeper into a function or method to see what 
# options are available to you.

# ?df.describe

In [34]:
# Include an option for 'describe' to give some descriptive information
# about categorical variables

df.describe(include='all')

Unnamed: 0,rank,discipline,yrs_since_phd,yrs_service,sex,salary
count,397,397,397.0,397.0,397,397.0
unique,3,2,,,2,
top,Prof,B,,,Male,
freq,266,216,,,358,
mean,,,22.314861,17.61461,,113706.458438
std,,,12.887003,13.006024,,30289.038695
min,,,1.0,0.0,,57800.0
25%,,,12.0,7.0,,91000.0
50%,,,21.0,16.0,,107300.0
75%,,,32.0,27.0,,134185.0


In [35]:
# Include categorical data AND specify the exact percentile cutoffs for
# numeric data

df.describe(include='all', percentiles=[0.025,0.25,0.5,0.75,0.975] )

Unnamed: 0,rank,discipline,yrs_since_phd,yrs_service,sex,salary
count,397,397,397.0,397.0,397,397.0
unique,3,2,,,2,
top,Prof,B,,,Male,
freq,266,216,,,358,
mean,,,22.314861,17.61461,,113706.458438
std,,,12.887003,13.006024,,30289.038695
min,,,1.0,0.0,,57800.0
2.5%,,,3.0,0.0,,70761.2
25%,,,12.0,7.0,,91000.0
50%,,,21.0,16.0,,107300.0


In [38]:
# PRACTICE - Change the percentile cutoffs to any values you want in 
# the range [0.0,1.0]

df.describe(include='all', percentiles=[0.075,0.22,0.55,0.78,0.979] )

Unnamed: 0,rank,discipline,yrs_since_phd,yrs_service,sex,salary
count,397,397,397.0,397.0,397,397.0
unique,3,2,,,2,
top,Prof,B,,,Male,
freq,266,216,,,358,
mean,,,22.314861,17.61461,,113706.458438
std,,,12.887003,13.006024,,30289.038695
min,,,1.0,0.0,,57800.0
7.5%,,,4.0,2.0,,74788.6
22%,,,11.0,6.0,,88798.6
50%,,,21.0,16.0,,107300.0


# Part 3 - Brief Primer on Modules, Functions, and Parameters/Options
Understanding some of the basics of how to import modules and use functions is critical to making Python work for you!

### **Importing modules and functions**
#### Let's say a module exists somehwere on the internet that we want to use for our Python project:
> ##### "exampleModule"

#### That module likely contains many functions and other tools:
> ##### "function1" ,"function2" ,"function3" ,"function4"

#### There are two main ways we can bring these tools into our Python script and use them...
> ##### *import exampleModule as exmo*
#### This imports everything from the exampleModule.  In order to use the functions and tools, we use the abbreviation:
> ##### *import exampleModule as exmo*
> ##### *exmo.function1()*
> ##### *exmo.function2()*
> ##### *exmo.function3()*

#### Alternatively, we can import individual functions from modules
> ##### *from exampleModule import function1, function3*
#### Now, the individual functions are explicitly imported and defined, so we can just use the function names directly:
> ##### *from exampleModule import function1, function3*
> ##### *function1()*
> ##### *function3()*

### **Importing modules and functions - in practice**
Note: the following code will not actually produce anything substantive on screen.

In [40]:
# # Example
# import moduleName as abbreviation
# abbreviation.functionName()

# Option 1 (used in this workshop)
import pandas as pd
pd.DataFrame()

# Option 2
from pandas import DataFrame
DataFrame()

### **Using parameters/options in functions and methods**
#### While you can create your own functions (and "methods") from scratch, this section focuses on using them in the context of modules that we have imported.
#### At its core, once we've imported a module and want to start using its functions, it's important to understand a few key elements.
##### "Calling" or "call" is the verb sometimes used to describe when you use a function. In the code below, we import the exampleModule, give it an abbreviation, and then "call" or use function1:
> ##### import exampleModule as exmo
> ##### exmo.function1()
##### By calling function1, we are in essence telling Python to "do whatever it is that function1 does!"  That's vague, but the truth is, functions can do a LOT of different thing.  To understand what any individual function does, you will need to read the reference/documentation materials for the module you are using.
##### Functions can sometimes be used as standalone operations:
> ##### exmo.function1()
##### But often, functions may have optional or mandatory input variables that determine if and how the function works:
> ##### exmo.function1( parameter1 = value1, parameter2 = value2, etc...)
> ##### exmo.function1( favorite_movie = "Forrest Gump", favorite_number = 3.14 )
##### Every individual function from every individual module is different in terms of what is required, what is optional, and what the function does.  So, it is 100% worth your time to read through and understand the modules and particular functions you intend to use in your research.  You can obviously find this information on websites, but the info is also usually availble directly in Python by putting a "?" before an individual function (without the parentheses):
> ##### import exampleModule as exmo
> ##### ?exmo.function1
> ##### 
> ##### from exampleModule import function1
> ##### ?function1

### **Using parameters/options in functions and methods - in practice**

In [42]:
# # Example
# import moduleName as abbreviation
# abbreviation.functionName()

# Option 1 (used in this workshop)
import pandas as pd
?pd.DataFrame

# Option 2
from pandas import DataFrame
?DataFrame

[1;31mInit signature:[0m
[0mDataFrame[0m[1;33m([0m[1;33m
[0m    [0mdata[0m[1;33m=[0m[1;32mNone[0m[1;33m,[0m[1;33m
[0m    [0mindex[0m[1;33m:[0m [0mUnion[0m[1;33m[[0m[0mCollection[0m[1;33m,[0m [0mNoneType[0m[1;33m][0m [1;33m=[0m [1;32mNone[0m[1;33m,[0m[1;33m
[0m    [0mcolumns[0m[1;33m:[0m [0mUnion[0m[1;33m[[0m[0mCollection[0m[1;33m,[0m [0mNoneType[0m[1;33m][0m [1;33m=[0m [1;32mNone[0m[1;33m,[0m[1;33m
[0m    [0mdtype[0m[1;33m:[0m [0mUnion[0m[1;33m[[0m[0mForwardRef[0m[1;33m([0m[1;34m'ExtensionDtype'[0m[1;33m)[0m[1;33m,[0m [0mstr[0m[1;33m,[0m [0mnumpy[0m[1;33m.[0m[0mdtype[0m[1;33m,[0m [0mType[0m[1;33m[[0m[0mUnion[0m[1;33m[[0m[0mstr[0m[1;33m,[0m [0mfloat[0m[1;33m,[0m [0mint[0m[1;33m,[0m [0mcomplex[0m[1;33m,[0m [0mbool[0m[1;33m][0m[1;33m][0m[1;33m,[0m [0mNoneType[0m[1;33m][0m [1;33m=[0m [1;32mNone[0m[1;33m,[0m[1;33m
[0m    [0mcopy[0m[1;33m:[0m [0mboo

### **Functions vs. Methods**
#### Methods are very similar to functions in that a Method(...) may or may not need inputs in order to "do something".  However, the difference is that a "function" is called on its own:
> ##### import exampleModule as exmo
> ##### exmo.function1()
#### Whereas, a "method" is called as an extension of an existing object:
> ##### object = value1
> ##### object.method()
#### Like functions, the exact options and details of how a method works will change from method to method, object to object.  Different types of objects may have many different types of methods.  Just like with functions, it is important to read the documentation:
> ##### object = data_value1
> ##### object.method_1()
> ##### object.method_1(option1 = value1, option2 = value2, ...)
> ##### ?object.method

### **Using methods - in practice**

In [43]:
# # Example
# object = data_value1
# object.method()

x = "oranges"

In [44]:
x.capitalize()

'Oranges'

In [45]:
x.swapcase()

'ORANGES'

### **Accessing contents of an object**
#### The last major thing to note about Python notation for this workshop is the use of brackets [...] to access the contents of individual objects.  This is extremely important as researchers because we often work with lists, arrays, and tables of data.  In Python, this data is usually stored in an object (like a Pandas DataFrame) and using [...] allows us to access the information in particular ways.
#### The exact way you use brackets [...] will depend on the exact nature of the individual object you are using.  But the general patter in as follows:
> ##### *single item*
>> ##### object[identifier]
>> ##### object[15]
> ##### *range of items*
>> ##### object[identifier:identifier]
>> ##### object[15:42]
> ##### *item by name*
>> ##### object["item_name"]
>> ##### object["oranges"]

In [46]:
# Hypothetically, let's say the object X is a list and contains a list of words
X = ["cat", "dog", "bananas", "oranges", "coffee", "tea"]
X

['cat', 'dog', 'bananas', 'oranges', 'coffee', 'tea']

In [None]:
# If we only want to access or view a single item from the list X
# then we can use square brackets [].

In [47]:
# Show the first item (note: counting starts at zero in Python)
X[0]

'cat'

In [48]:
# Show the third item
X[2]

'bananas'

In [49]:
# Show a range of items
X[2:5]

['bananas', 'oranges', 'coffee']

# Part 4 - Variable Selection and Descriptive Statistics (continued)

In [None]:
# So far we've applied functions and methods to the entire dataset at once.  
# However, sometimes that is impractical or undesired.  You can view subests
# of the data in a variety of ways that will produce the same results 

# EXAMPLE SYNTAX for single column

# dataframe['columnName1']     

# The outer-brackets are used to signify when you want to slice, index, 
# or capture a subset of the dataframe.

# dataframe.loc[:,'columnName1'] 

# The ".loc" method is more precise and reliable than the previous approach.

# When using the two methods above, the output/returned info is a single
# column from the DataFrame - also known in Pandas as
# a "Series" or, in our current syntax, "pd.Series(...)"

In [50]:
# First approach

df['yrs_service']

0      18
1      16
2       3
3      39
4      41
       ..
392    30
393    19
394    25
395    15
396     4
Name: yrs_service, Length: 397, dtype: int64

In [51]:
# Second approach

df.loc[:,'yrs_service']

0      18
1      16
2       3
3      39
4      41
       ..
392    30
393    19
394    25
395    15
396     4
Name: yrs_service, Length: 397, dtype: int64

In [55]:
# PRACTICE - Show only data from the column "rank" using the first approach

df["rank"]

0          Prof
1          Prof
2      AsstProf
3          Prof
4          Prof
         ...   
392        Prof
393        Prof
394        Prof
395        Prof
396    AsstProf
Name: rank, Length: 397, dtype: object

In [56]:
# PRACTICE - ...now use the second approach

df.loc[:,"rank"]

0          Prof
1          Prof
2      AsstProf
3          Prof
4          Prof
         ...   
392        Prof
393        Prof
394        Prof
395        Prof
396    AsstProf
Name: rank, Length: 397, dtype: object

In [57]:
# Pandas Series, just like DataFrames, can take on a variety of methods 
# useful for descriptive statistics

df['yrs_service'].describe()

count    397.000000
mean      17.614610
std       13.006024
min        0.000000
25%        7.000000
50%       16.000000
75%       27.000000
max       60.000000
Name: yrs_service, dtype: float64

In [58]:
# Calculate the average

df['yrs_service'].mean()

17.614609571788414

In [63]:
# PRACTICE
# Identify the median of "yrs_services" in the data

df['yrs_service'].median()

16.0

In [59]:
# Calculate the standard deviation

df['yrs_service'].std()

13.006023785473102

In [64]:
# Minimum value

df['yrs_service'].min()

0

In [66]:
# PRACTICE
# Maximum value

df['yrs_service'].max()

60

In [67]:
# Identify the value at the specified quantile/percentile point 
# (measured from 0.0 to 1.0)

df['yrs_service'].quantile( 0.25 )

7.0

In [69]:
# PRACTICE
# Identify the value at the 0.75 quantile point.

df['yrs_service'].quantile( 0.75 )

27.0

In [70]:
# Use brackets and commans to identify a list of quantile levels. 
# e.g.  lists  -->  [... , ... , ...]
# e.g.  df['yrs_service'].quantile([... , ... , ...])

df['yrs_service'].quantile( [ 0.25, 0.75 ] )

0.25     7.0
0.75    27.0
Name: yrs_service, dtype: float64

In [72]:
# Pandas can not readily calculate all statistical measures that may be of value to you.
# Many other modules exist to help with these needs.  SciPy is one of the single most robust
# and well maintained scientific and statistical modules in the Python ecosystem.

import scipy.stats as stats

In [73]:
# Now that we're using a different module to call statistical functions, we need to change our syntax.

# EXAMPLE SYNTAX: abbrev.function(...)
# EXAMPLE SYNTAX: stats.function(dataframe['variableName']

In [74]:
# Calculate the "skew" of a distribution

stats.skew( df['yrs_service'] )

0.6481088240680348

In [75]:
# PRACTICE - Calculate the skew for the column "salary"

stats.skew( df['salary'] )

0.7118657337591157

In [76]:
# Calcuate the "kurtosis" of a distribution

stats.kurtosis( df['yrs_service'] )

-0.32259018904703085

In [77]:
# Many statistical measures have different formulations, assumptions, and requirements.
# Always check the documentation to make sure you are getting the outputs you expect.

?stats.kurtosis

[1;31mSignature:[0m [0mstats[0m[1;33m.[0m[0mkurtosis[0m[1;33m([0m[0ma[0m[1;33m,[0m [0maxis[0m[1;33m=[0m[1;36m0[0m[1;33m,[0m [0mfisher[0m[1;33m=[0m[1;32mTrue[0m[1;33m,[0m [0mbias[0m[1;33m=[0m[1;32mTrue[0m[1;33m,[0m [0mnan_policy[0m[1;33m=[0m[1;34m'propagate'[0m[1;33m)[0m[1;33m[0m[1;33m[0m[0m
[1;31mDocstring:[0m
Compute the kurtosis (Fisher or Pearson) of a dataset.

Kurtosis is the fourth central moment divided by the square of the
variance. If Fisher's definition is used, then 3.0 is subtracted from
the result to give 0.0 for a normal distribution.

If bias is False then the kurtosis is calculated using k statistics to
eliminate bias coming from biased moment estimators

Use `kurtosistest` to see if result is close enough to normal.

Parameters
----------
a : array
    Data for which the kurtosis is calculated.
axis : int or None, optional
    Axis along which the kurtosis is calculated. Default is 0.
    If None, compute over the wh

In [None]:
# Different methods apply differently to a Series depending on what type of data is contained
# therein. For categorical data in particular, there are a few nuances and additional functions 
# you will find useful.

In [78]:
# The Pandas method 'describe' will work when individual categorical variables are specified.
# The results will only provide a count of all samples, the number of unique levels/factors/categories
# present in the data, the most frequent level, and the frequency of the most frequent level.

df['discipline'].describe()

count     397
unique      2
top         B
freq      216
Name: discipline, dtype: object

In [79]:
# 'value_counts' gives a level-by-level frequency count of each unique value

df['discipline'].value_counts()

B    216
A    181
Name: discipline, dtype: int64

In [80]:
# using '...value_counts(normalize=True)' will return the proportion of the dataset represented
# by each categorical level

df['discipline'].value_counts(normalize=True)

B    0.544081
A    0.455919
Name: discipline, dtype: float64

# Part 5 - Indexing & Selecting Samples

In [81]:
# Lastly, we will start to perform what is referred to as
# "indexing" our dataframe. In short, this means selecting specific rows 
# and columns from the dataframe.

# Generally speaking, the way you index a DataFrame object follows this progression:
# df                      call the whole dataframe
# df.loc                  using df, add the location (loc) method
# df.loc[:,:]             using df and the loc method, add [:,:] to select all rows and all columns
# df.loc[rows,columns]    using df with loc specify which rows and which columns


# General syntax example:                       dataframeObject.loc[ rows , columns  ]

# Rows 0 to 15, column "salary"                  df.loc[ 0:15 , "salary" ]

# Rows 44 to 46, columns "salary" and "sex"      df.loc[ 44:46 , ["salary","sex"] ]

# For non-consecutive rows...
# Rows 3, 7, 11, 13, 17 and
# columns "salary" and "rank"               df.loc[ [3,7,11,13,17]  , ["salary","rank"]  ]

# Example! Rows 3-17, column "salary"
df.loc[ 3:17 , "salary" ]

3     115000
4     141500
5      97000
6     175000
7     147765
8     119250
9     129000
10    119800
11     79800
12     77700
13     78000
14    104800
15    117150
16    101000
17    103450
Name: salary, dtype: int64

In [82]:
# Rows 3-17, columns "salary" and "rank"
df.loc[ 3:17 , ["salary","rank"] ]

Unnamed: 0,salary,rank
3,115000,Prof
4,141500,Prof
5,97000,AssocProf
6,175000,Prof
7,147765,Prof
8,119250,Prof
9,129000,Prof
10,119800,AssocProf
11,79800,AsstProf
12,77700,AsstProf


In [83]:
# Rows 3-17, columns "salary" and "rank"
df.loc[ 3:17 , ["salary","rank"] ]

Unnamed: 0,salary,rank
3,115000,Prof
4,141500,Prof
5,97000,AssocProf
6,175000,Prof
7,147765,Prof
8,119250,Prof
9,129000,Prof
10,119800,AssocProf
11,79800,AsstProf
12,77700,AsstProf


In [84]:
# Rows 3, 7, 13, and 19; columns "salary" and "rank"
df.loc[ [3,7,7,13,19] , ["salary","rank"] ]

Unnamed: 0,salary,rank
3,115000,Prof
7,147765,Prof
7,147765,Prof
13,78000,AsstProf
19,137000,Prof


In [85]:
# PRACTICE - Using the df object, locate (loc) data for the 
# rows 10 to 15 and the column "sex".

df.loc[ 10:15 , "sex" ]

10    Male
11    Male
12    Male
13    Male
14    Male
15    Male
Name: sex, dtype: object

In [86]:
# PRACTICE - Using the df object, locate (loc) data for the 
# rows 15 to 25, and the column "salary".

df.loc[ 15:25 , "salary" ]

15    117150
16    101000
17    103450
18    124750
19    137000
20     89565
21    102580
22     93904
23    113068
24     74830
25    106294
Name: salary, dtype: int64

In [87]:
# PRACTICE - Using the df object, locate (loc) data for the 
# rows 45 to 55, and the columns "salary" and "yrs_since_phd".

df.loc[ 45:55 , ["salary","yrs_since_phd"] ]

Unnamed: 0,salary,yrs_since_phd
45,114778,25
46,98193,40
47,151768,23
48,140096,25
49,70768,1
50,126621,28
51,108875,12
52,74692,11
53,106639,16
54,103760,12


In [88]:
# PRACTICE - Using the df object, locate (loc) data for the 
# non-consecutive rows 101, 202, and 303 and the columns "salary" and "yrs_since_phd".

df.loc[ [101,202,303] , ["salary" , "yrs_since_phd"] ]

Unnamed: 0,salary,yrs_since_phd
101,126933,28
202,160400,28
303,105260,14


In [92]:
# PRACTICE - From scratch, using the df object, locate (loc) 
# data for the rows from 3145 to 3155 and the columns "yrs_since_phd" and "sex"



# Part 6 - SAVE YOUR WORK!

In [None]:
# Finally, after all of that hard work modifying, transforming, and manipulating your data
# do not forget to save your work!  This is easy with Pandas.  You simply use the df object
# and append the .to_csv(...) method to it.  You will need to provide a filename and any other options
# that are relevant to your data and needs.

In [93]:
# Create a new_df object, representing a subset of the original df dataset.
new_df = df.loc[ [3,5,7,9,11] , ["salary","rank","yrs_service","sex"] ]

In [94]:
new_df

Unnamed: 0,salary,rank,yrs_service,sex
3,115000,Prof,39,Male
5,97000,AssocProf,6,Male
7,147765,Prof,45,Male
9,129000,Prof,18,Female
11,79800,AsstProf,2,Male


In [95]:
# Export data using .to_csv(...)
new_df.to_csv("exported_data.csv", index=False)