# PYTHON BASICS

This is the Jupyter Notebook IDE, which is the preferred IDE for data science.

Pressing <b>Shift+Enter</b> will compile and execute the code.

## BASIC OPERATIONS

<b>1. Print out your name:</b>

    print('your name') 
   

<b> 2. Try the following operations: </b>

    1 + 2
    10 - 2
    3 * 3
    4 / 2
    13 % 3
    3 > 2
    3 == 2
    10 + 2 == 3 * 4
    1 <= 1
    1 < 1 or 3 > 2
    1 < 1 and 3 > 2
    7 == 7 and not 4<4
    

<b> 3. Create and manipulate variables: </b>

    x = 10
    y = 2*5
    x == y
    
    x, y = 1, 5
    x == y 
    

## DATA TYPES

Before conducting any sort of operations or manipulations, it is important to understand the object type that is being used. Each data type has certain allowable functions/methods and type mismatches can be very annoying.

<b>1. Numerical Types </b>

    Integer -> Whole Numbers
    Float -> Decimals
    
    Important to note especially when doing certain operations. 
    
    >>> 15/4
    3.75

<b>2. String</b>

    Non-numerical data type, usually text. 
    ALWAYS enclosed with quotation marks, either "" or ''.
    Can also perform operations.
    
    >>> 'hello' + ' ' + 'world'

    Can also access a subsection of the string.
    
    >>> my_string = 'Hello world!'
    >>> my_string[0] #Gets first character, note that indexing starts at 0.
    'H'
    
    >>> my_string[1]
    'e'
   
    >>> my_string[-1] #Get the last character of the string
    '!'
    
    >>> my_string[2:4] #Get a substring. Note that second number is non-inclusive.
    'll'
    
    >>> my_string[:3] #Gets a substring from the first character to the third
    'Hel'
    
    >>> my_string[4:] #Gets a substring from the fourth character to the end
    'o world!'
    
    >>> my_string[::2] #Gets every other character
    'Hlowrd'
    

<b>3. List</b>

    A way to group certain objects together. Does not necessarily have to be the same types.
    Created by using square brackets [].
    
    >>> my_list = ['Ram', 25, 'De Guzman', 1992]
    
    Lists are iterable and thus can be indexed similar to strings as well.
    
    >>> my_list[1]
    25
    
    >>> my_list[:2]
    ['Ram', 25]

    Some useful methods to use with lists.
    
    >>> my_list.append('Philippines') #Adds an item to a list
    >>> my_list.remove('Ram') #Removes the first instance in a list
    >>> my_list.insert('MSBA', 2) #Adds a new item at a specific location
    >>> len(my_list) #Returns how many items are in a list

    Can use logical statements easily to check if item is in a list.
    
    >>>'Ram' in my_list




<b>4. Tuple </b>

    An ordered collection of items created by using parenthesis ().
    Similar to lists, but are unalterable after they are created, making them more rigid.
    
    Useful for managing ordered collections.
    
    >>> my_tuple = (1, 2)
    

<b> 5. Sets </b>

    Unordered collection of DISTINCT objects. Created by using {} or set().
    
    >>> my_set = {1, 1, 1, 2, 2, 3, 4, 5, 6, 6, 7, 7}

<b> 6. Dictionaries </b>

    Unordered data set using key pairs. Each item in a dictionary has a 'key' and an associated 'value'.
    Indexed using their keys.
    
    >>> my_dict = {"business": 4121, "math": 2061, "visual arts": 7321}
    >>> my_dict['business']

    To update dictionary,
        
        >>> my_dict[<new key>] = <value>
        
    Can get list of keys and values by using
    
        >>> my_dict.keys() and my_dict.values()

### Converting Data Types

To detemine an object type, you can use the type() method.<br>
It is possible to 'type cast' a certain object -- meaning converting it to another data type.

    x = 1
    str(x) -> convert to string
    int(x) -> convert back to integer
    float(x) -> convert to float
    
    y = [1, 1, 2, 3, 4, 4, 5]
    type(y) -> list
    set(y) -> create unique set of values from y
    list(y) -> convert back to list


## USER DEFINED FUNCTIONS

It is possible to create a function to allow easy repition of certain operations. <br>
Uses the <b> def </b> commamnd.

    def getSquare (input):
        x = input ** 2
        return x
    
Note that variables defined inside the function have a local scope, and thus does not apply outside of the function.

## LOGICAL FLOW STATEMENTS

### If-Then-Else Statements

    if <condition> then:
        <result>
    else:
        <alternative result>

### While Loop

    while <condition is true>:
        
        <operation to be performed>
        
Note that very important that the condition you set MUST TERMINATE.<br>
A non-terminating condition will lead to an infinite loop and cause your code to crash.

### For Loop

    for <var_name> in <iterable object>:
        
        <operation to be performed for every iteration in the defined object>


## NUMPY

Numpy is a powerful python library used for multi-dimensional array manipulation.

    import numpy as np

The basic data structure of numpy is an <b>array</b>.

    >>> np.array([1, 2, 3, 4, 5]) --> a one dimensional array / vector
    >>> np.array([[1,2,3],[4,5,6]]) --> a 2-D array
    
You can do slicing on a numpy array similar to a list, but with two dimensions <b>array[row index, col index]</b>.
    
    >>> A = np.array([[1,2,3],[4,5,6]])
    >>> A[0,0]
    
        1
        
    >>> A[1, 1]    
    
        5
           
    

Unlike lists, when using mathematical operations on arrays, it performs <b>component-wise operations</b> - similar to vectors and matrices in mathematics.

    >>> A = np.array([1,2,3])
    >>> A + 3
    
        4, 5, 6
        
    >>> B = np.array([4,5,6])
    >>> A + B
        
        array([5, 7, 9])

Attributes of arrays include:

    dtype | Types of the elements in an array
    ndim  | The number of axes (dimensions) of the array.
    shape | A tuple of integers indicating the size in each dimension.
    size  | The total number of elements in the array.
    
    You can get any attribute of a numpy array by calling the attribute: A.ndim <-- note no parenthesis because it is not a method.

You can also use numpy to create commonly used arrays using the following methods:
    
    arange()  | An array of evenly spaced values within a given interval (like range()).
    linspace()| An array of evenly spaced values in a given interval where the number of elements is specified
    eye()     | A 2-D array with ones on the diagonal and zeros elsewhere.
    ones()    | A new array of given shape and type, filled with ones.
    zeros()   | A new array of given shape and type, filled with zeros.
    diag()    | Extract a diagonal or construct a diagonal array.


#### MASKING

Masking usually helps in finding a subset of values that meets a certain criteria. For example, 

    >>> A = np.array([-1, 3, -5, 6, -8])
    >>> mask = A>0
    >>> mask
    
        array([False,  True, False,  True, False], dtype=bool)
        
    >>> A[mask]
        
        array([3, 6])
           

#### Other Numpy Methods

    all()    | True if all elements evaluate to True.
    any()    | True if any elements evaluate to True.
    argmax() | Index of the maximum value.
    argmin() | Index of the minimum value.
    argsort()| Indices that would sort the array.
    max()    | The maximum element of the array.
    mean()   | The average value of the array.
    min()    | The minimum element of the array.
    sort()   | Return nothing; sort the array in-place.
    std()    | The standard deviation of the array.
    sum()    | The sum of the elements of the array.
    var()    | The variance of the array.
    
   Follows np.random._
    
    choice() | Take random samples from a 1-D array.
    random() | Uniformly distributed floats over [0, 1).
    randint()| Random integers over a half-open interval.
    random_integers() | Random integers over a closed interval.
    randn()  | Sample from the standard normal distribution.
    permutation() | Randomly permute a sequence / generate a random sequence

## PLOTTING

    from matplotlib import pyplot as plt
    import numpy as np


#### LINE PLOT    
    >>> x = np.linspace(-100, 100, 201)
    >>> y = x**2

    >>>plt.plot(x, y)
    >>>plt.show()

#### SCATTER PLOT

    >>> x = np.random.normal(scale = 1, loc = 1, size = 1000)
    >>> y = np.random.normal(scale = 2, loc = 10, size = 1000)

    >>> plt.scatter(x, y, col = 'red')
    >>> plt.title("My Scatter Plot")
    >>> plt.ylabel('y_label')
    >>> plt.xlabel('x_label')
    >>> plt.show()

#### HISTOGRAM

    >>> plt.hist(x, bins = 50, color = '#3cb371')
    >>> plt.title('Histogram')
    >>> plt.show()

#### BOXPLOT

    >>> plt.boxplot(np.hstack((x.reshape(-1,1),y.reshape(-1,1))))
    >>> plt.title('Boxplot of X & Y')
    >>> plt.xticks(np.arange(1,3), ['N(1, 1)','N(10,2)'])
    >>> plt.show()

#### CORRELATION PLOT

    >>> import seaborn as sns
    
    >>> z = x+y/2
    >>> cor = np.corrcoef(np.matrix([x, y, z]))
    >>> sns.heatmap(cor)

    >>> plt.title('Heatmap')
    >>> plt.xticks(np.arange(3)+0.5, ['x','y','x+y/2'])
    >>> plt.yticks(np.arange(3)+0.5, ['x','y','x+y/2'])
    >>> plt.show()


#### USING SUBPLOTS

    >>> x = np.linspace(0, 2000, 2001)
    >>> z = np.random.standard_t(1, len(x))
    >>> y = np.random.normal(0, 1, len(x))

    UPPER RIGHT PLOT - TIME SERIES PLOT FOR Y & Z
    >>> plt.subplot(221)
    >>> plt.plot(x, z, '#CB4335', label = 't(1)')
    >>> plt.plot(x, y, '#148F77', label = 'N(0,1)')
    >>> plt.title('Time Series Plot')
    >>> plt.legend(loc = (1,1.25), fontsize = 10)

    UPPER LEFT PLOT - TIME SERIES PLOT FOR Y & Z
    >>> plt.subplot(222)
    >>> comb = np.hstack((y.reshape(-1, 1), z.reshape(-1,1)))
    >>> plt.boxplot(comb)
    >>> plt.xticks(np.arange(1,3), ['N(0,1)', 't(1)'])
    >>> plt.title('Boxplots')

    #LOWER LEFT PLOT - HISTOGRAM FOR Y
    >>> plt.subplot(223)
    >>> plt.hist(y, color = '#148F77')
    >>> plt.title('Distribution of N(0,1)')

    #LOWER RIGHT PLOT - HISTOGRAM FOR Z
    >>> plt.subplot(224)
    >>> plt.hist(z, color = '#CB4335', bins = 20)
    >>> plt.title('Distribution of t(1)')

    >>> plt.suptitle('Comparison of N(0,1), t(1)', x= 0.53,y = 1.2, fontsize = 15)

    >>> plt.tight_layout()
    >>> plt.show()

## PANDAS

    import pandas as pd
    
Pandas is the most popular and widely used dataframe library in python. Especially useful for data analysis and manipulation. 

#### SERIES

The most basic data structure of pandas is a series which is similar to an array -- except it can hold different types of elements and are combined with an index.

    >>> s1 = pd.Series(np.arange(9, -1, -1))
    >>> s1.values
    >>> s1.index
    
We can customize the index of a series as desired.

    >>> s1 = pd.Series(np.arange(9, -1, -1), index = ['a','b','c','d','e','f','g','h','i','j'])
    
Or use dictionaries to create a series.

    >>> ages = {'Ram': 31, 'Rom': 28, 'Rem': 34, 'Rum': 26}
    >>> pd.Series(ages)


#### DATAFRAME

A DataFrame is a multi-dimensional series, with a rows and columns. 

We can create dataframes using multiple series objects:

    >>> ages = {'Ram': 31, 'Rom': 28, 'Rem': 34, 'Rum': 26}
    >>> grades = {'Ram': 'B', 'Rom': 'A', 'Rem': 'C+', 'Rum': 'F'}
    >>> pd.DataFrame({'Grades':grades, 'Ages': ages})

Most of the time, we will be importing our data from an external file -- like a csv.

    >>> df_titanic = pd.read_csv('Titanic/Titanic.csv')
    >>> df_titanic.head()

To get a quick view of our data, we can use the following method.

    >>> df_titanic.dtypes #Get data types per column
    >>> df_titanic.describe() #Get an idea of distribution of values

We can check for missing values by using the following method

    >>> null_vals = df_titanic.isnull()
    >>> null_vals.sum(axis = 0)
    
Note that the is.null() method transforms the entire data set into true or false, with true representing the null values. We can then sum that per column to get the number of null values per column.

To deal with missing values, we can opt to drop the entire row completely. 

    >>> df_titanic.dropna()
    
Note however that this is not always the best option as deleting data can remove valuable information or may represent 0's instead. If we want to replace the null values, we can use,

    >>> df_titanic.fillna(0)
    
As an exercise, try to fill the missing values of AGE with the mean age. For cabin, there are too many missing values with no distinguishable way to reconstruct them. As such, we can opt to just drop the column completely as follows:

    >>> df_titanic.drop('Cabin', axis =1)

Traditional slicing does not work with a DataFrame outright. To get subsets of a DataFrame, queries similar to SQL queries are used to filter and sort through data. 

    Return only the Survived column.
    >>> df_titanic['Survived']
    >>> df_titanic.Survived
    
    Return the passengers who survived
    >>> df_titanic[df_titanic.Survived == 1]
    
    Return the passengers who survived and are male
    >>> df_titanic[(df_titanic.Survived ==1) & (df_titanic.Sex == 'male')]
    
    Return the passengers who survived and are male, but only return the Names and Class
    >>> df_titanic[(df_titanic.Survived ==1) & (df_titanic.Sex == 'male')][['Name', 'Pclass']]

To get a subset using traditional slicing, we can use the .iloc method. 

    >>> df_titanic.iloc[:,2]

In cleaning data, sometimes we want to apply some sort of transformation to an entire column - maybe to change the values to a more approriate one. pd.DataFrame.apply() allows us to do this.

    >>> def convertGen(x):
    >>>     if x == 'male':
    >>>         return(1)
    >>>     else:
    >>>         return(0)

    >>> df_titanic['Sex'].apply(convertGen)
    
As an exercise, try to write a function that converts Name to Titles and apply it to the dataframe.

#### PLOTTING WITH PANDAS

While you can use the traditional methods to plot, pandas have a quick way of plotting as well.

    >>> df_titanic.hist('Age')
    >>> plt.show()

    >>> df_titanic.plot('Age', 'Fare', kind = 'scatter')
    >>> plt.show()

Before plotting, we sometimes have to do some transformation to our data in order for it to make sense.
For example, to compare the number of survivals, we can use this:
    
    >>> SurvCount = df_titanic.Survived.value_counts()
    >>> SurvCount.plot(kind='bar')
    >>> plt.show()
    
.value_counts() is useful in general especially if we want to find the frequency of each unique value in a column.

We can also use pandas to aggregate some of our data of interest. 

    >>> SurvSex = pd.crosstab(df_titanic.Sex, df_titanic.Survived)
    >>> SurvSex.plot(bar, stacked = True)
    >>> plt.show()

We can also group continuous data into categorical bins by using the .cut method.

    >>> Age_cat = pd.cut(df_titanic.Age.dropna(), 3)
    >>> Age_cat.value_counts().plot(kind='bar')
    >>> plt.show()
    
We can also specify the cuts that we want.

    >>> Age_cat = pd.cut(df_titanic.Age.dropna(), [0, 8, 20, 40, 60])
    >>> Age_cat.value_counts().plot(kind='bar')
    >>> plt.show()