# Week 1: Installing and Introduction to Python's Data Mining Libraries

### What's on this week
1. [Why Python?](#whypython)
2. [Installing Anaconda](#install)
3. [Process flow of predictive mining in Python](#processflow)
4. [Interactive prototyping in ipython](#ipython)
5. [Defining problem and purpose of data mining](#purpose)
---

### Important Changelog:
* (25/07/2017) Made tutorial notes public.
* (26/07/2017) Small changes on instructions to run IPython on Windows.
* (02/11/2017) Large updates. Set notes to beta.

The practical note for this week introduces you to Python and its common machine learning libraries. Python is a high-level, interpreted programming language. It is used for wide range of purposes, from web servers to scientific computing. Its syntax emphasizes on readibility, which allow anyone to learn and use it quickly.

The practical sessions in this unit will be covering the usage of Python for data mining and machine learning purposes. We **WILL NOT ** cover the basics of Python. Fortunately, there is a lot of resources for learning Python from scratch, and you can reasonably learn the basics in a week.

We will use Python 3 in this unit. All examples are written using Python 3.5.2, but any version of Python 3 above 3.4 should work just fine. 

**This tutorial notes is in beta version. Please give us feedbacks and suggestions on how to make it better. Ask your tutor for any question and clarification.**



## 1. Why Python? <a name="whypython"></a>
In the field of data mining/machine learning, Python is arguably the fastest growing and most widely used programming language alongside R and Julia. There are a number of reasons for this:

### 1.1. Interpreted language
Python is designed as an interpreted language, which allow users to test and prototype models really quickly.

### 1.2. Open-source
Python is free and has no ties to any propertiary/corporate technologies, which makes Python the top choice for students, academics and startups.

### 1.3. Wide, cutting edge support for almost anything

Vast range of actively updated libraries for almost every data mining task.

* **pandas** for data wrangling and preprocessing ([link](http://pandas.pydata.org/))
* **scikit-learn** for supervised and unsupervised learning ([link](http://scikit-learn.org/stable/))
* **numpy** for matrix manipulation ([link](http://www.numpy.org/))
* **seaborn** and **matplotlib** for visualization ([link](https://seaborn.pydata.org/)) ([link2](https://matplotlib.org/))
* **ipython** for interactive prototyping ([link](https://ipython.org/))
* **jupyter notebook** for interactive, web-based prototyping ([link](http://jupyter.org/))


### 1.4. Production ready
Models and pipelines built with Python are very suitable to deployment in production systems.

## 2. Installing Anaconda <a name="install"></a>

Anaconda is a data science package for Python. It contains many essential data science libraries and is aimed to simplify the installation process. All libraries mentioned above are in Anaconda distribution except for Seaborn.

### 2.1. For Windows/Mac Users

For Windows users, simply download Anaconda and install it [link](https://www.anaconda.com/download/). Choose the latest Python3 version for Windows. Once you installed it, go to Start-Anaconda3-Anaconda Prompt. Type `conda install seaborn` to install Seaborn.


### 2.2. For Linux Users

For Linux users, go to Anaconda website to get the installation file [link](https://www.anaconda.com/download/). Choose the latest Python3 version. It should download an `.sh` file. Once the download process is finished, open your terminal and give execution permission with command `chmod +x [path_to_the_downloaded file]`. Run it to install by `./[path_to_downloaded file]`.

## 3. Process flow for predictive mining using Python<a name="processflow"></a>
![Predictive mining process flow in Python](https://s3-ap-southeast-2.amazonaws.com/dataminingtuts/process_flow_python.png  "Predictive mining process flow in Python")

The diagram above presents the steps we will take in this unit to perform predictive mining on the dataset. The first and most important step is to define problem and purpose of the data mining. You need to ask questions such as:

* What kind of data do we have?
* Why are we performing predictive mining on this data?
* What information are we trying to predict?
* How could the stakeholders (including yourself) use the insights we gained from the data mining?

After we understand the problem and purpose of the data mining process, next step is to explore the data. In this step, we try to understand patterns and distributions in the data. We should also identifies problems in the dataset, such as noise and missing values, to be cleaned and processed out in the next step. Both steps will be performed mainly using ```pandas``` with some help from ```sklearn```'s preprocessing modules.

Once the data is clean, it can be used to built predictive models. There are many algorithms available in ```sklearn```, each with its own characteristics. We will explore one algorithm at a time in the upcoming weeks.

In all stages, we also need to visualize the patterns and trends found in the data. Visualization allows us to understand the data better. In this unit, all visualizations will be done using ```seaborn``` and ```matplotlib``` with data presented by ```pandas``` dataframes.


## 4. Interactive prototyping with ipython<a name="ipython"></a>

```ipython``` is an interactive Python shell designed for fast prototyping. In data mining/machine learning, many engineers use ipython to quickly review the data and process they are working on.

### For you who are using Anaconda

To start the ipython console, go to Start-Anaconda3-IPython. It will start by default on your document folder. If you wish to save your projects on another directory, change the current directory using `cd "your directory path"`.

![Starting IPython from Anaconda in Windows](http://dataminingtuts.s3.amazonaws.com/anaconda_ipython.png "Starting IPython from Anaconda in Windows")

### For you who are using Linux/Unix and installed the libraries manually

We can call ipython the same way as we call the python interpreter itself:

```bash
ipython
```

```bash
# Output
Python 3.5.2 (default, Nov 17 2016, 17:05:23) 
Type 'copyright', 'credits' or 'license' for more information
IPython 6.1.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: 
```

All examples in this unit are shown using ipython console.

## 5. Defining problem and purpose of data mining process<a name="purpose"></a>

**Scenario:** A national veterans organisation is seeking to improve their donation solicitations targeting. By only soliciting the most likely donors, less money will be spent on solicitation efforts and more money will be available for charitable concerns. Of particular interest is the class of individuals identified as lapsing donors. They have ran a greeting card mailing campaign called **PVA97NK**. The organisation now seeks to classify its lapsing donors based on their responses to this campaign. With this classification, a decision can be made to either solicit or ignore a lapsing individual in next year campaign.

Now it is up to you, as a data science professional employed by this organisation, to use this dataset to improve their solicitation effort.

The `PVA97NK` dataset is available in `dataset/pva97nk.csv` file. Let's start to determine the answer the essential questions above by exploring the data. Import the dataset into our `ipython` console with `pandas`.

In [1]:
import pandas as pd
df = pd.read_csv('datasets/pva97nk.csv')

Once the dataset is imported, we can start by looking at the columns/variables available. We can use `.info()` function for this purpose.

> **pandas.DataFrame.info()** provide concise summary of a DataFrame, such as number of entries (rows), data columns and their respective data types and memory usage.

In [2]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9686 entries, 0 to 9685
Data columns (total 28 columns):
TargetB             9686 non-null int64
ID                  9686 non-null int64
TargetD             4843 non-null float64
GiftCnt36           9686 non-null int64
GiftCntAll          9686 non-null int64
GiftCntCard36       9686 non-null int64
GiftCntCardAll      9686 non-null int64
GiftAvgLast         9686 non-null float64
GiftAvg36           9686 non-null float64
GiftAvgAll          9686 non-null float64
GiftAvgCard36       7906 non-null float64
GiftTimeLast        9686 non-null int64
GiftTimeFirst       9686 non-null int64
PromCnt12           9686 non-null int64
PromCnt36           9686 non-null int64
PromCntAll          9686 non-null int64
PromCntCard12       9686 non-null int64
PromCntCard36       9686 non-null int64
PromCntCardAll      9686 non-null int64
StatusCat96NK       9686 non-null object
StatusCatStarAll    9686 non-null int64
DemCluster          9686 non-null int64
De

The PVA97NK dataset contains 29 columns/variables including ID, demographics of members, donation history of members, etc. There are two possible target variables that we are looking to predict:
1. **TARGETB**: binary of whether a person is a lapsing donor or not.
2. **TARGETD**: interval value of amount of donation given in response to the mailing campaign.

With these information, we could now answer the questions listed in section 3:
* **What kind of data do we have?**: 29 variables with various information about the donors.
* **Why are we performing predictive mining on this data?**: We would like to find possible lapsing donors to improve our donation solicitation campaign.
* **What information are we trying to predict?**: Whether a person is a possible lapsing donor or not, corresponding to **TARGETB**.
* **How could the stakeholders (including yourself) use the insights we gained from the data mining?**:
    1. Improved accuracy of the solicitation campaign, which result in higher response rate and less wasted effort.
    2. Find underlying characteristics of lapsing donors, leading to better understanding of what makes donors return.

Looks like we have got an interesting and useful data mining project in hand. :)

## End notes and next week
This week, we learned how to install Python and its libraries with Anaconda. We also learned about the typical data mining process flow in Python and explored a bit of the dataset to understand why we are performing data mining on it.

Next week, we will focus on exploring trends and performing data cleaning/preprocessing on the PVA97NK dataset.