## What is data science?

In [None]:
from IPython.display import Image
from IPython.core.display import HTML 

The term 'data science' has in recent times really become a buzzword. We hear about data scientists and data analyists being highly sought after and it becoming a really popular branch of IT.

But what exactly is data science and what are the skills neccessary to become someone who analyzes data?



Data science is kind of an interdisciplinary field, which means it is at an intersection of multiple disciplines. At the centre of it is data - the most valuable commodity of today.

Data science seeks to analyze and process data in such a way that new knowledge can be extracted from it. This knowledge serves to improve businesses and advance science by helping the decision making process (what should we do?), making predictions (what will hapen next?) and discover patterns (find hidden information in the data).

In [None]:
Image(url= "img/DS.png", width=400, height=1200)

source: https://jakevdp.github.io/PythonDataScienceHandbook/00.00-preface.html

This diagram basically explains where data science lies in the intersection of different disciplines. It comprises the skills of statistics (knowing how to summarize datasets), computer science (knowing how to model algorithms) an domain expertise (neccessary to formulate the research questions and put the answers into context).

### Where is Data Science needed?

It's used in many disciplines, such as science, banking, consulting, health care, e-commerce, politics, and manufacturing.


Some areas where data science is used:
* marketing (sales analysis)
* election predictions
* to predict the best time for deliveries
* to forsee delays in traffic
* etc

In the end, for our purposes, data science can help us better understand our data and provide some basic statistics as well as find out the relations inside the data.

Some of DS taks that are also useful for us:
* asking the right questions
* exploring and collecting data
* extracting data
* cleaning the data
* finding and replacing missing values
* normalizing data
* analyzing data, finding patterns and making predictions
* representing the results (visualisations)

### Data 

Types of data:
* unstructured (Tweets, novels, reviews)
* structured (databases, tables)

One purpose of Data Science is to structure data, making it interpretable and easy to work with.



### Data analysis in Python

Python is the "main language" of data science, as it has many inbuilt functions as well as libraries developed for the purposes of data science. 

Some of these libraries:
* [pandas](https://pandas.pydata.org/) - for structured data operations
* [numpy](https://numpy.org/) - a mathematical libtary (linear algebra, Fourier transform)
* [matplotlib](https://matplotlib.org/) - visualizations
* [SciPy](https://scipy.org/) - linear algebra models

### Data analysis and manipulation with Pandas

What can we find out with pandas?

* what is the correlation between two or more columns?
* what is the average value?
* what are the min and max values?

To install pandas - https://pandas.pydata.org/docs/getting_started/install.html

In [None]:
import pandas

Let's check our pandas version:

In [None]:
pandas.__version__

Usually, for pandas we use the abbreviation 'pd' so, we actually do our import such as:

In [None]:
import pandas as pd

Of course, for data science, data types are extremely important. We are already very familiar with the standard Python datatypes, such as strings, ints, lists, dictionaries, etc. In pandas, two important datatypes (or objects) are called Series and Dataframe.

### Pandas Series

Pandas series is kind of like a column in a table. It's actually a one-dimensional array that holdes any type of data. We can create it out of a list, e.g.:

In [None]:
a = [1, 2, 3]
my_series = pd.Series(a)
my_series

If nothing else is specified, the values are labeled with their index number. First value has index 0, second value has index 1 etc.

This label can be used to access a specified value.



In [None]:
my_series[0]

With the index argument, you can name your own labels.



In [None]:
my_series = pd.Series(a, index = ["x", "y", "z"])

my_series

When you have created labels, you can access an item by referring to the label.



In [None]:
print(my_series['x'])

We can think about series as a type of a specialized dictionary. As we know, dictionaries have keys and values and Series also maps typed keys to sets of typed values.


In [None]:
population_dict = {'California': 38332521,
                   'Texas': 26448193,
                   'New York': 19651127,
                   'Florida': 19552860,
                   'Illinois': 12882135}



<div class="alert alert-block alert-info">
<b>Exercise 1</b>
    
<p>
    <li>Make a Series object out of Population_dict </li>
    <li>find out the population of California </li>
 
</p>
</div>

We can also slice the data:

In [None]:
population['Texas':'Illinois']

### Pandas Dataframes

Dataframes are structured representations of data, and unlike Series, often have more than one column.

 Let's start by making a dataframe with some numbers:

In [None]:
d = {'col1': [1, 2, 3, 4, 7], 'col2': [4, 5, 6, 9, 5], 'col3': [7, 8, 12, 1, 11]}

In [None]:
df = pd.DataFrame(data=d, index = ['number1', 'number2', 'number3', 'number4', 'number5'])
df

We see that "col1", "col2" and "col3" are the names of the columns. The names of the indexes we added through a list.
They tell us the information about the position of the rows.

Now, we can use Python to count the columns and rows and ```df.shape``` to find the number of columns:



In [None]:
count_row = df.shape[0]

In [None]:
print(count_row)

To demonstrate this, let's first construct a new Series listing the area of each of the five states discussed in the previous section:



In [None]:
area_dict = {'California': 423967, 'Texas': 695662, 'New York': 141297,
             'Florida': 170312, 'Illinois': 149995}


Now that we have this along with the population Series from before, we can use a dictionary to construct a single two-dimensional object containing this information:



In [None]:
states = pd.DataFrame({'population': population,
                       'area': area})
states

### Indexing 

The individual Series that make up the columns of the DataFrame can be accessed via dictionary-style indexing of the column name:



In [None]:
states['area']


In [None]:
states.area

Like with the Series objects discussed earlier, this dictionary-style syntax can also be used to modify the object, in this case adding a new column:



In [None]:
states['density'] = states['population'] / states['area']
states

With the index argument, you can name your own indexes.



As mentioned previously, we can also view the DataFrame as an enhanced two-dimensional array. We can examine the raw underlying data array using the values attribute:



In [None]:
states.values


With this picture in mind, many familiar array-like observations can be done on the DataFrame itself. For example, we can transpose the full DataFrame to swap rows and columns:



In [None]:
states.T

Thus for array-style indexing, we need another convention. Here Pandas again uses the loc and iloc indexers mentioned earlier. Using the iloc indexer, we can index the underlying array, but the DataFrame index and column labels are maintained in the result.
Different from the loc method, iloc is integer position based. It only works with the inherent row position number ranged 0-n.



In [None]:
states.iloc[:3, :2]


In [None]:
states.loc[:'Florida', :'population']


Similarly, using the loc indexer we can index the underlying data in an array-like style but using the explicit index and column names:



In [None]:
print(states.loc['California'])


Direct masking operations are also interpreted row-wise rather than column-wise:



In [None]:
states[states.density > 100]


<div class="alert alert-block alert-info">
<b>Exercise 2</b>
    
<p>
    <li>Make a new dataframe out of exercise_data  </li>
    <li>Give the rows indexes as follows: 'day1', 'day2', 'day3' </li>
    <li>Return 'day3' using loc </li>
</p>
</div>

In [None]:
exercise_data = {
  "calories": [420, 380, 390],
  "duration": [50, 40, 45]
}


### Creating a dataframe from csv

This is a functionality that we use a lot, for readin in our own datasets.

In [None]:
df = pd.read_csv('data/koreanTV.csv')
df

In [None]:
df.shape


We can also preview the DataFrame data using the head function. The head function display the first five rows of records in the Pandas DataFrame.



In [None]:
df.head()

On another hand, we can also use the tail function to display the last five rows of records.



In [None]:
df.tail()


### Basic analysis of our data

Many often not all the data in a Pandas DataFrame are relevant to our work. Fortunately, Pandas DataFrame allow us to selectively extract one or more columns of data and work with them.



We can use a selected column name as the parameter to extract the data from that particular column.



In [None]:
df['Title'].head()


We can also select multiple columns by putting the target column names in a list.



In [None]:
df[['Title', 'Rating']].head()


If we really do not need a column we can always remove it. For that we use ```drop```

In [None]:
df = df.drop("Short Story", axis=1)
df.head()

Use the drop function to remove the column “Short Story” from the DataFrame and shows the first five rows of records. The “axis=1” is to specify the current removal is for a column instead of a row.

Now let's use our rows to navigate over the dataframe. We will first try with loc method. The loc method allow us to select a row from a DataFrame using a index label. For example, we can define the movie title as our index label in the DataFrame and use it to extract our target rows of data.



In [None]:
df = df.set_index('Title')
df.head()


In [None]:
df.shape

The “Title” is removed from the column list of the DataFrame and is marked as an index label now.
As we see by the number of the columns, we have removed one and set one as an index - now we have 6.

We can make use of this movie name as an index label to extract the data from the DataFrame by row.



In [None]:
df.loc['Squid Game']

In [None]:
df.loc[['Squid Game', 'Happiness']]

The second approach to work with Pandas DataFrame by row is using the iloc method.


In [None]:
df.iloc[3]

In [None]:
df.iloc[2:5]

Let's now remove a row. As said before, we can use a movie name as index label to remove a particular row. In this case, the row with movie name, “My Name” will be removed from the DataFrame.


In [None]:
df = df.drop('My Name', axis=0)
df.head()

<div class="alert alert-block alert-info">
<b>Exercise 3</b>
    
<p>
    <li>Import housing.csv into a dataframe  </li>
    <li>Preview the head of the data </li>
    <li>Remove the latitude column </li>
    <li>Set a condition to select data with the housing median age above 50</li>
</p>
</div>

### References and further reading:
    
 * Python Data Science Handbook - https://jakevdp.github.io/PythonDataScienceHandbook/
 * https://www.w3schools.com/python/pandas/default.asp
 * https://www.geeksforgeeks.org/data-science-tutorial/
 * data taken from - https://github.com/teobeeguan/Python-For-Machine-Learning/tree/main/Pandas
 * tutorial -  https://medium.com/ds-notes/learning-python-pandas-in-minutes-part-1-basics-f24463da1a18
 