# Introduction
The first step in most data analytics projects is reading the data file. In this section, you'll create `Series` and `DataFrame` objects, both by hand and by reading data files.

*Note:* This notebook is heavily inspired from Pandas course at Kaggle Learn.

# Relevant Resources
* [General Pandas Cheat Sheet](https://assets.datacamp.com/blog_assets/PandasPythonForDataScience.pdf)


Run the code cell below to load libraries / packages you will need

In [0]:
import pandas as pd

In the cell below, create a DataFrame `fruits` that looks like this:

![](https://i.imgur.com/Ax3pp2A.png)

In [0]:
# Creating a dataframe matching the above diagram and assigning it to the variable named fruits.
fruits = pd.DataFrame({'Apples': [30], 'Bananas': [21]})
fruits

Create a dataframe `fruit_sales` that matches the diagram below:

![](https://i.imgur.com/CHPn7ZF.png)

In [0]:
## Create a dataframe matching the above diagram and assign it to the variable fruit_sales.
## Hint: Use index parameter for the left most columns.
fruit_sales = 
fruit_sales

Create a variable called `ingredients` with a `pd.Series` that looks like:

```
Flour     4 cups
Milk       1 cup
Eggs     2 large
Spam       1 can
Name: Dinner, dtype: object
```

In [0]:
ingredients = pd.Series(['4 cups','1 cup','2 large', '1 can'], index=['Flour', 'Milk', 'Eggs', 'Spam'], name='Dinner')

ingredients

### Downloading files to Colab

Use tool called **wget** with terminal command "!"

In [0]:
# Downloading a file to colab:
!wget https://gist.github.com/tdchaitanya/d84c787328df169c50a06eb1669666c9/raw/7ffeddc80bec1c22e91bfed6e026620cf989eacf/housing_data.csv

### Loading CSV files to Pandas

Read the following csv dataset of housing data into a DataFrame called `housing`:


In [0]:
housing = pd.read_csv('housing_data.csv')

housing.head()

Download and read another file called "winemag-data-130k-v2.csv" from "https://github.com/davestroud/Wine/raw/master/winemag-data-130k-v2.csv" to a variable called `reviews`.

In [0]:
## Download the data here.
## Hint: use wget.


In [0]:
reviews = 
## Check if it looks okay, is there any problem?
reviews.head()
## Correct the problem below:
reviews = 
reviews.head()

In [0]:
## Fetch the column called 'description' to a variable called desc:
desc = 

## what type of object is desc? If you're not sure, you can check by calling Python's type function: type(desc)

In [0]:
## Select the first value from the description column of `reviews`, assigning it to variable `first_description`.
first_description = 

first_description

Select the first row of data (the first record) from `reviews`, assigning it to the variable `first_row`.

In [0]:
## Select the first row of data (the first record) from `reviews`, assigning it to the variable `first_row`.
first_row = 

first_row

Select the first 10 values from the `description` column in `reviews`, assigning the result to variable `first_descriptions`.

In [0]:
first_descriptions = reviews.description.iloc[:10]

first_descriptions

Select the records with index labels `1`, `2`, `3`, `5`, and `8`, assigning the result to the variable `sample_reviews`.

In other words, generate the following DataFrame:

![](https://i.imgur.com/sHZvI1O.png)

In [0]:
sample_reviews = reviews.iloc[[1,2,3,5,8],:]

sample_reviews

Create a variable `df` containing the `country`, `province`, `region_1`, and `region_2` columns of the records with the index labels `0`, `1`, `10`, and `100`. In other words, generate the following `DataFrame`:

![](https://i.imgur.com/FUCGiKP.png)

In [0]:
df = reviews.loc[[0,1,10,100], ['country', 'province', 'region_1', 'region_2']]
df

Create a variable `df` containing the `country` and `variety` columns of the first 100 records. 

Hint: you may use `loc` or `iloc`. When working on the answer this question and the several of the ones that follow, keep the following "gotcha" described in the [reference](https://www.kaggle.com/residentmario/indexing-selecting-assigning-reference) for this tutorial section:

> `iloc` uses the Python stdlib indexing scheme, where the first element of the range is included and the last one excluded. So `0:10` will select entries `0,...,9`. `loc`, meanwhile, indexes inclusively. So `0:10` will select entries `0,...,10`.

> [...]

> ...[consider] when the DataFrame index is a simple numerical list, e.g. `0,...,1000`. In this case `reviews.iloc[0:1000]` will return 1000 entries, while `reviews.loc[0:1000]` return 1001 of them! To get 1000 elements using `iloc`, you will need to go one higher and ask for `reviews.iloc[0:1001]`.

In [0]:
df = reviews.loc[0:99,['country','variety']]

In [0]:
## Create a DataFrame italian_wines containing reviews of wines made in Italy:
italian_wines = 

## Add Data Visualization via matplotlib.

Wine-producing provinces of the world to the number of labels of wines they produce:

In [0]:
reviews['province'].value_counts().head(10).plot.bar()

In [0]:
## This bar chart tells us absolute numbers, but it's more useful to know relative proportions. 


The number of reviews of a certain score allotted by Wine Magazine:

In [0]:
reviews['points'].value_counts().sort_index().plot.bar()

In [0]:
## The number of reviews of a certain score allotted by Wine Magazine using line() graph:


### Relationship between **price** and **points**.

The simplest bivariate plot is the **scatter plot**. A simple scatter plot simply maps each variable of interest to a point in two-dimensional space. This is the result:

In [0]:
reviews[reviews['price'] < 100].sample(100).plot.scatter(x='price', y='points')