

<h1> Data Analysis with Python - Lab I </h1>
For more  Data science and Statistics check out <a href= "https://cognitiveclass.ai/courses/data-analysis-python">Data Analysis wit Python</a> for Free! 


<hr>

### Table of Contents 
[**Numpy Library**](#Numpy-Lib)   
[***pandas* Library**](#Pandas-Lib)  
[**CO2 emissions by Country by Year**](#CO2-emissions-by-Country-by-Year)  
[**Get the Data**](#Get-the-Data)  
[**Import the data using *pandas***](#Import-the-data-using-Pandas)  
[**Dataframe characteristics **](#Dataframe-characteristics)  
[**Subsetting the Dataframe**](#Subsetting-the-Dataframe)  
[**Conditional Subsetting**](#Conditional-Subsetting)  





<hr>

<h1>NumPy Library</h1>

__NumPy__: 
- Fast
- Multidimensional Arrays 
- Vectorized Computation

### NumPy ndarray
N-dimensional array object.  
To perform mathematical operations on whole blocks of data. 

In [None]:
# to leverage a library, use "import"
import numpy as np

#### Creating ndarrays
- Use the __array__ function.  
- All of the elements must be the __same type__ (homogeneous)

In [None]:
# create an nd array
data=np.array([[ 1.9526, -0.246 , -0.8856],
[ 0.5639, 0.2379, 0.9104]])

# display the array
data

Every array has __number of dimensions__ and __shape__ (a tuple indicating the size of each dimension)

In [None]:
# get number of dimensions of array
data.ndim

In [None]:
# get size of array
data.shape

__dtype:__ an object describing the data type of the array:


In [None]:
# get type of data in array
data.dtype

__empty__, __range__ , __zeros__ and **ones** arrays

In [None]:
data0 = np.zeros(10)
data0

In [None]:
data1 = np.empty((3, 2))
data1

In [None]:
data2 = np.arange(10)
data2

### <span style="color: red">YOUR TURN:</span> 
create a 2x4 array of ones and find the data type and shape of the array

In [None]:
# your code


#### Data Types

In [None]:
data = np.array([1.25, -9.6, 42], dtype=np.string_)
data

cast to floating point

In [None]:
data.astype(np.float32)

### Operations

In [None]:
arr = np.array([[1., 2., 3.], [4., 5., 6.]])
arr

Arithmetic operations

In [None]:
arr*2

In [None]:
arr*arr

### Indexing and Slicing
to select a subset of your data or individual elements

In [None]:
arr = np.arange(10)
arr

Same as Python lists:

In [None]:
arr[5]

In [None]:
arr[5:8]

you assign a scalar value to a slice:

In [None]:
arr[5:8] = 12
arr

In [None]:
arr[5:]

In an n-dimensional array, the elements at each index are no longer scalars but rather one-dimensional arrays

In [None]:
arr2d = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
arr2d

In [None]:
arr2d[2]

In [None]:
arr2d[0][2]   # also: arr2d[0, 2]

In [None]:
arr2d[2:]

In [None]:
arr2d[:2, 1:]

### Boolean Indexing

In [None]:
data=np.random.rand(7, 4)
data

In [None]:
data > 0.4    # Comparison Operators: ==, !=, &, |

In [None]:
data[(data > 0.4)]

### Functions

In [None]:
arr = np.arange(5)
np.sqrt(arr)   # square root function

In [None]:
np.exp(arr)   # exponentiation function

In [None]:
np.mean(arr) # average function

### <span style="color: red">YOUR TURN:</span> 

Create a 3x2 numpy array (use __ones__ function), and select its second row

In [None]:
## YOUR CODE BELOW




<h1><em>pandas<em> Library</h1>

*pandas*
- for most kinds of data analysis
- for structured or tabular data
- high-level data structure
- built on top of NumPy  
- time series manipulation


Objects:  
- Series  
- Dataframe  



First, import the *pandas* package

In [None]:
import pandas as pd

#### Series

A Series is a one-dimensional array-like object containing an array of data (of any NumPy data type) and an associated array of data labels, called its index. 

In [None]:
obj2 = pd.Series([4, 7, -5, 3], index=['d', 'b', 'a', 'c'])
obj2

In [None]:
obj2[['c', 'a', 'd']]

In [None]:
'b' in obj2

### Dataframe
- represents a tabular, spreadsheet-like data structure 
- ceach columns can be a different value type (numeric, string, boolean, etc.)
- has both a row and column index

<h1>CO2 emissions by Country by Year</h1>

Carbon dioxide emissions are those stemming from the burning of fossil fuels and the manufacture of cement. They include carbon dioxide produced during consumption of solid, liquid, and gas fuels and gas flaring.

http://data.worldbank.org/indicator/EN.ATM.CO2E.PC/

<h2>Import the data using Pandas</h2>

#### Import required `pandas` library

In [None]:
import pandas as pd

#### Import data using `pd.read_csv`

In [None]:
data = pd.read_excel("co2_data.xlsx", skiprows = 4)
print('Data read into a pandas dataframe!')

In [None]:
type(data) # checks the type of object

In [None]:
data

#### Display first 5 rows of `data` using `head`

In [None]:
data.head()

#### Take just the first two rows using `head`

In [None]:
data.head(2)

In [None]:
data.head(n=1)

### Don't remember what the parameters are for a function? Use `?`

In [None]:
# to close help window, press q
? data.head

#### Display last 7 rows of data

In [None]:
data.tail(3)

<hr>

<h2>Dataframe characteristics</h2>

#### How many rows and columns are there?

In [None]:
data.shape # (rows, columns)

In [None]:
data.describe()

#### What are the column names?

In [None]:
data.columns

#### What is the first column name?

In [None]:
data.columns[0]

#### What is the first and second column name?

In [None]:
data.columns[[0,1]]

<hr>

## Review Questions

### <span style="color: red">YOUR TURN:</span> 

#### Print the first 3 rows of `data`

In [None]:
## YOUR CODE BELOW



#### Print the names of the first and last column names

In [None]:
## YOUR CODE BELOW


<hr>

<h2>Subsetting the Dataframe</h2>

### Select columns

#### Select columns by name:

In [None]:
data['Country Name']

#### Select columns by column number:

In [None]:
data[data.columns[0]]

#### Select the last column:

In [None]:
data[data.columns[-1]]

#### Subset Multiple Columns by Name

In [None]:
data[['Country Name', 'Country Code']]

### Select rows

A few different ways:

In [None]:
data[0:2] # first and second row

In [None]:
data[:2] 

In [None]:
data.iloc[[0,2]] # first and third rows

In [None]:
data[0:2]['Country Name']

In [None]:
#data.ix[0:3, 1:4] # first row

#### Select the last row

In [None]:
data.iloc[-1]

<h2, align=center>Conditional Subsetting</h2>

Recall: there are various logical operators to create logical statements.

In [None]:
1 == 2

In [None]:
"Me" != "You"

In [None]:
1000 > 1

When you apply a logical statement to an array, an array of Trues/Falses are returned, with respect to the logical statement.

In [None]:
import numpy as np
my_range = np.array(range(1, 100))
my_range

In [None]:
result = my_range > 50
result

Then you can use the array of Trues/Falses to return *only the true values* from the original array.

In [None]:
my_range[result]

#### Select rows based on a condition

In [None]:
data['Country Name']

In [None]:
# data where the country name is Albania. Returns True/False
data['Country Name'] == 'Albania'

In [None]:
data[data['Country Code'] == 'CHN'] # subset based on condition

#### Select rows based on multiple conditions

In [None]:
data[(data['Country Name'] == 'China') & (data['Country Code'] == 'CHN')]

#### Why does the following return no hits?

In [None]:
data[(data['Country Name'] == 'Canada') & (data['Country Code'] == 'CHN')]

### Select data by row and column

What does the following do?

<hr>

## Review Question

### <span style="color: red">YOUR TURN:</span> 
<h3>Find the CO2 Emission per capita for France and Germany in 2010 and 2011</h3>

In [None]:
# hint
# data['Country Name']== ??
# data[['Country Name', '2010', '2011']].iloc[[??,??]]

In [None]:
## YOUR CODE BELOW



<hr>
Copyright &copy; 2018 [Cognitive Class](https://cognitiveclass.ai/). This notebook and its source code are released under the terms of the [MIT License](https://cognitiveclass.ai/mit-license/).