# Pandas

In [None]:
#Pandas Main Data Structure
#1.Series 
#2.DataFrames

In [None]:
"""
Series - is a 1-D array like object that can hold any data type. it is similar to a column in a table.
DataFrame - is a 2-D,size-mutable and potentially heterogeneous tabular data structure with labeled axes (rows and columns)
"""

In [1]:
#Create a series from dictionary
import pandas as pd
data = {'a':1,'b':2,'c':3}
series_dict = pd.Series(data)
print(series_dict)

a    1
b    2
c    3
dtype: int64


In [2]:
#DataFrame
import pandas as pd
pd.DataFrame({"yes":[50,21],'No':[131,2]})


Unnamed: 0,yes,No
0,50,131
1,21,2


In [3]:
pd.DataFrame({"number":[20,32,23],"name":["python","Java","ruby"]})



Unnamed: 0,number,name
0,20,python
1,32,Java
2,23,ruby


In [None]:
pd.DataFrame({'Bob':['i like it','it was awful'],
              'Sue':['Pretty good.','Bland']})
            


In [None]:
pd.DataFrame({'Bob':['i like it','it was awful'],
              'Sue':['Pretty good.','Bland']},
             index = ["Product A","Product B"])

In [None]:
#SeriesSeries
# A Series, by contrast, is a sequence of data values. If a DataFrame is a table, a Series is a list. And in fact you can create one with nothing more than a list:
pd.Series([1,2,3,4,5])
pd.Series([9,8,7,6,5,4])

In [None]:
pd.Series([30,35,40],index=['maths','phy','chem'], name = "Subjects")

In [None]:
#A Series is, in essence, a single column of a DataFrame. So you can assign row labels to the Series the same way as before, using an index parameter. However, a Series does not have a column name, it only has one overall name
pd.Series([30,35,40],index=['2015 Sales','2016 Sales','2017 sales'],name = "Product A")

In [None]:
# In the cell below, create a DataFrame fruits that looks like this:
fruits = pd.DataFrame({"Apple":[40,50,60],"Bananas":[50,232,223]})
fruits

In [None]:
fruits = pd.DataFrame([[30,21]], columns=["Apples","Bananas"])
fruits


In [None]:
fruits = pd.DataFrame([[35,21],[41,34]],columns=["Apple","Banana"],
                      index = ["Item 1","Item 2"])
fruits

In [None]:
fruit_sales = pd.DataFrame([[35, 21], [41, 34]], columns=['Apples', 'Bananas'],
                index=['2017 Sales', '2018 Sales'])
fruit_sales

In [None]:
import pandas as pd
ingredients = pd.Series(
    ["4 cups", "1 cup", "2 large", "1 can"],
    index=["Flour", "Milk", "Eggs", "Spam"],
    name="Dinner" )
ingredients

In [None]:
m = pd.Series(
    [1,2,3,45,66,89],
    index=['a','b','c','d','e','f'],
    name = "Values")
m


In [None]:
# Read the following csv dataset of wine reviews into a DataFrame called reviews:
# reviews = pd.read_csv('csvfile', index_col=0)
# reviews

In [None]:
#Run the cell below to create and display a DataFrame called animals
animals = pd.DataFrame({'Cows': [12, 20], 'Goats': [22, 19]}, index=['Year 1', 'Year 2'])
print(animals)
#In the cell below, write code to save this DataFrame to disk as a csv file with the name cows_and_goats.csv.
animals.to_csv("cows_and_goats.csv")


In [None]:
"""In Python, we can access the property of an object by accessing it as an attribute. A book object, for example, might have a title property, which we can access by calling book.title. Columns in a pandas DataFrame work in much the same way.
Hence to access the country property of reviews we can use  = reviews.country """
# reviews.country

"""If we have a Python dictionary, we can access its values using the indexing ([]) operator. We can do the same with columns in a DataFrame """
# reviews['country']
# reviews['country'][0]


# Indexing in pandas
The indexing operator and attribute selection are nice because they work just like they do in the rest of the Python ecosystem. As a novice, this makes them easy to pick up and use. However, pandas has its own accessor operators, loc and iloc. For more advanced operations, these are the ones you're supposed to be using.

# Index-based selection
Pandas indexing works in one of two paradigms. The first is index-based selection: selecting data based on its numerical position in the data. iloc follows this paradigm.




In [None]:
#To select the first row of data in a DataFrame, we may use the following:
#reviews.iloc[0] #first row of data


Both 'loc' and 'iloc' are row-first, column-second. This is the opposite of what we do in native Python, which is column-first, row-second.

This means that it's marginally easier to retrieve rows, and marginally harder to get retrieve columns. To get a column with iloc, we can do the following:


In [None]:
#columns
#reviews.iloc[:,0]
'''For example, to select the country column from just the first, second, and third row, we would do:'''
# reviews.iloc[:3,0]
#reviews.iloc[1:3, 0]
"""It's also possible to pass a list:"""
# reviews.iloc[[0, 1, 2], 0]
'''the last five elements of the dataset.'''
#reviews.iloc[-5:]

In [None]:
# Select the records with index labels 1, 2, 3, 5, and 8, assigning the result to the variable sample_reviews.

# In other words, generate the following DataFrame:
sample_reviews = reviews.loc[[1,2,3,5,8]]
sample_reviews

# Label-based selection 
The second paradigm for attribute selection is the one followed by the loc operator: label-based selection. In this paradigm, it's the data index value, not its position, which matters.



In [None]:
# For example, to get the first entry in reviews, we would now do the following:
# reviews.loc[0,'country']

iloc is conceptually simpler than loc because it ignores the dataset's indices. When we use iloc we treat the dataset like a big matrix (a list of lists), one that we have to index into by position. loc, by contrast, uses the information in the indices to do its work. Since your dataset usually has meaningful indices, it's usually easier to do things using loc instead

In [None]:
#For example, here's one operation that's much easier using loc:
# reviews.loc[:,['taster_name', 'taster_twitter_handle', 'points']]

# Choosing between loc and iloc
When choosing or transitioning between loc and iloc, there is one "gotcha" worth keeping in mind, which is that the two methods use slightly different indexing schemes.

iloc uses the Python stdlib indexing scheme, where the first element of the range is included and the last one excluded. So 0:10 will select entries 0,...,9. loc, meanwhile, indexes inclusively. So 0:10 will select entries 0,...,10.

Why the change? Remember that loc can index any stdlib type: strings, for example. If we have a DataFrame with index values Apples, ..., Potatoes, ..., and we want to select "all the alphabetical fruit choices between Apples and Potatoes", then it's a lot more convenient to index df.loc['Apples':'Potatoes'] than it is to index something like df.loc['Apples', 'Potatoet'] (t coming after s in the alphabet).

This is particularly confusing when the DataFrame index is a simple numerical list, e.g. 0,...,1000. In this case df.iloc[0:1000] will return 1000 entries, while df.loc[0:1000] return 1001 of them! To get 1000 elements using loc, you will need to go one lower and ask for df.loc[0:999].

Otherwise, the semantics of using loc are the same as those for iloc.

# Manipulating the index
Label-based selection derives its power from the labels in the index. Critically, the index we use is not immutable. We can manipulate the index in any way we see fit.

The set_index() method can be used to do the job. Here is what happens when we set_index to the title field:

In [None]:
#reviews.set_index("title")
# This is useful if you can come up with an index for the dataset which is better than the current one.

<!-- Conditional selection -->

# Conditional selection
So far we've been indexing various strides of data, using structural properties of the DataFrame itself. To do interesting things with the data, however, we often need to ask questions based on conditions.



In [None]:
#For example, suppose that we're interested specifically in better-than-average wines produced in Italy.
# we can start by checking if each wine is Italian or not:
reviews.country == "Italy"
# This operation produced a Series of True/False booleans based on the country of each record. This result can then be used inside of loc to select the relevant data:
reviews.loc[reviews.country == "Italy"]
'''This DataFrame has ~20,000 rows. The original had ~130,000. That means that around 15% of wines originate from Italy.

We also wanted to know which ones are better than average. Wines are reviewed on a 80-to-100 point scale, so this could mean wines that accrued at least 90 points.

We can use the ampersand (&) to bring the two questions together'''
reviews.loc[(reviews.country == 'Italy') & (reviews.points >= 90)]

# Suppose we'll buy any wine that's made in Italy or which is rated above average. For this we use a pipe (|):
reviews.loc[(reviews.country == 'Italy') | (reviews.points >= 90)]



In [None]:
'''Pandas comes with a few built-in conditional selectors, two of which we will highlight here.

The first is isin. isin is lets you select data whose value "is in" a list of values. For example, here's how we can use it to select wines only from Italy or France:'''

reviews.loc[reviews.country.isin(['Italy','France'])]


In [None]:
# The second is isnull (and its companion notnull). These methods let you highlight values which are (or are not) empty (NaN). For example, to filter out wines lacking a price tag in the dataset, here's what we would do:
reviews.loc[reviews.price.notnull()]

In [None]:
# Assigning dataÂ¶
# Going the other way, assigning data to a DataFrame is easy. You can assign either a constant value:

reviews['critic'] = 'everyone'
reviews['critic']

#Or with an iterable of values:

reviews['index_backwards'] = range(len(reviews), 0, -1)
reviews['index_backwards']

