## What is data science?

In [1]:
from IPython.display import Image
from IPython.core.display import HTML 

The term 'data science' has in recent times really become a buzzword. We hear about data scientists and data analyists being highly sought after and it becoming a really popular branch of IT.

But what exactly is data science and what are the skills neccessary to become someone who analyzes data?



Data science is kind of an interdisciplinary field, which means it is at an intersection of multiple disciplines. At the centre of it is data - the most valuable commodity of today.

Data science seeks to analyze and process data in such a way that new knowledge can be extracted from it. This knowledge serves to improve businesses and advance science by helping the decision making process (what should we do?), making predictions (what will hapen next?) and discover patterns (find hidden information in the data).

In [2]:
Image(url= "img/DS.png", width=400, height=1200)

source: https://jakevdp.github.io/PythonDataScienceHandbook/00.00-preface.html

This diagram basically explains where data science lies in the intersection of different disciplines. It comprises the skills of statistics (knowing how to summarize datasets), computer science (knowing how to model algorithms) an domain expertise (neccessary to formulate the research questions and put the answers into context).

### Where is Data Science needed?

It's used in many disciplines, such as science, banking, consulting, health care, e-commerce, politics, and manufacturing.


Some areas where data science is used:
* marketing (sales analysis)
* election predictions
* to predict the best time for deliveries
* to forsee delays in traffic
* etc

In the end, for our purposes, data science can help us better understand our data and provide some basic statistics as well as find out the relations inside the data.

Some of DS taks that are also useful for us:
* asking the right questions
* exploring and collecting data
* extracting data
* cleaning the data
* finding and replacing missing values
* normalizing data
* analyzing data, finding patterns and making predictions
* representing the results (visualisations)

### Data 

Types of data:
* unstructured (Tweets, novels, reviews)
* structured (databases, tables)

One purpose of Data Science is to structure data, making it interpretable and easy to work with.



### Data analysis in Python

Python is the "main language" of data science, as it has many inbuilt functions as well as libraries developed for the purposes of data science. 

Some of these libraries:
* [pandas](https://pandas.pydata.org/) - for structured data operations
* [numpy](https://numpy.org/) - a mathematical libtary (linear algebra, Fourier transform)
* [matplotlib](https://matplotlib.org/) - visualizations
* [SciPy](https://scipy.org/) - linear algebra models

### Data analysis and manipulation with Pandas

What can we find out with pandas?

* what is the correlation between two or more columns?
* what is the average value?
* what are the min and max values?

To install pandas - https://pandas.pydata.org/docs/getting_started/install.html

In [5]:
import pandas

Let's check our pandas version:

In [6]:
pandas.__version__

'1.4.4'

Usually, for pandas we use the abbreviation 'pd' so, we actually do our import such as:

In [26]:
import pandas as pd

Of course, for data science, data types are extremely important. We are already very familiar with the standard Python datatypes, such as strings, ints, lists, dictionaries, etc. In pandas, two important datatypes (or objects) are called Series and Dataframe.

### Pandas Series

Pandas series is kind of like a column in a table. It's actually a one-dimensional array that holdes any type of data. We can create it out of a list, e.g.:

In [10]:
a = [1, 2, 3]
my_series = pd.Series(a)
my_series

0    1
1    2
2    3
dtype: int64

If nothing else is specified, the values are labeled with their index number. First value has index 0, second value has index 1 etc.

This label can be used to access a specified value.



In [11]:
my_series[0]

1

With the index argument, you can name your own labels.



In [13]:
my_series = pd.Series(a, index = ["x", "y", "z"])

my_series

x    1
y    2
z    3
dtype: int64

When you have created labels, you can access an item by referring to the label.



In [16]:
print(my_series['x'])

1


We can think about series as a type of a specialized dictionary. As we know, dictionaries have keys and values and Series also maps typed keys to sets of typed values.


In [19]:
population_dict = {'California': 38332521,
                   'Texas': 26448193,
                   'New York': 19651127,
                   'Florida': 19552860,
                   'Illinois': 12882135}



<div class="alert alert-block alert-info">
<b>Exercise 1</b>
    
<p>
    <li>Make a Series object out of Population_dict </li>
    <li>find out the population of California </li>
 
</p>
</div>

We can also slice the data:

In [21]:
population['Texas':'Illinois']

Texas       26448193
New York    19651127
Florida     19552860
Illinois    12882135
dtype: int64

### Pandas Dataframes

Dataframes are structured representations of data, and unlike Series, often have more than one column.

 Let's start by making a dataframe with some numbers:

In [23]:
d = {'col1': [1, 2, 3, 4, 7], 'col2': [4, 5, 6, 9, 5], 'col3': [7, 8, 12, 1, 11]}
d

{'col1': [1, 2, 3, 4, 7], 'col2': [4, 5, 6, 9, 5], 'col3': [7, 8, 12, 1, 11]}

In [29]:
df = pd.DataFrame(data=d)
df

Unnamed: 0,col1,col2,col3
0,1,4,7
1,2,5,8
2,3,6,12
3,4,9,1
4,7,5,11


We see that "col1", "col2" and "col3" are the names of the columns.Do not be confused about the vertical numbers ranging from 0-4. They tell us the information about the position of the rows.

Now, we can use Python to count the columns and rows and ```df.shape``` to find the number of columns:



In [30]:
count_row = df.shape[0]

In [31]:
print(count_row)

5


Let's make a dataframe out of our previous Series population example.

In [32]:
df_population = pd.DataFrame(population, columns=['population'])
df_population

Unnamed: 0,population
California,38332521
Texas,26448193
New York,19651127
Florida,19552860
Illinois,12882135


Now, let's use our rows to 