<a href='http://www.stepupanalytics.com'> <img src='stepup_picture.png' /></a>

# Splitting a dataframe into two subsets

### Assuming we want to split a dataframe in two parts, let's assign one subset 80% of rows and other subset the remaining 20%.

In [1]:
import pandas as pd

startup = pd.read_csv("startup_funding.csv")

In [2]:
len(startup)

3009

Clearly we'll be splitting these 3009 rows into two random subsets.

In [3]:
startup_1 = startup.sample(frac=0.80)

Now to get the other subset, we'll use the drop() method to drop all the rows which are present in 'startup_1'.

In [4]:
startup_2 = startup.drop(startup_1.index)

<a href="https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sample.html">**sample() documentation**</a>

In [5]:
len(startup_1)

2407

In [6]:
len(startup_2)

602

In [7]:
len(startup_1) + len(startup_2)

3009

# String Methods in pandas

In [9]:
#String method for converting the values of a series to uppercase

startup.Location.str.upper().head()

0          PUNE
1        MUMBAI
2        MUMBAI
3     HYDERABAD
4    BURNSVILLE
Name: Location, dtype: object

In [13]:
# String method 'contains' helps in checking whether a substring consists a substring and it gives results as a boolean series

startup.Location.str.contains('umb').head()

0    False
1     True
2     True
3    False
4    False
Name: Location, dtype: object

In [16]:
startup['Industry_Vertical'].fillna(value='Unknown',inplace=True)

In [17]:
# using the boolean series for filtering the dataframe

startup[startup.Industry_Vertical.str.contains('tech')].head()

Unnamed: 0,Sr No,Date,Startup_Name,Industry_Vertical,Sub_Vertical,Location,Investors,Investment_Type,Amount_in_USD
60,61,28-05-2019,FreshVnF,Agtech,Fresh Agriculture Produces,Mumbai,Equanimity Ventures,Seed Round,140000000
68,69,11-04-2019,Setu,Fintech,Banking,Bengaluru,Lightspeed India Partners,Seed Funding,3500000
69,70,10-04-2019,Toppr,Edtech,E-learning,Mumbai,Milestone,Debt and Preference capital,6320820
71,72,10-04-2019,Unacademy,Edtech,E-learning,Bengaluru,Kalyan Krishnamurthy,Seed/ Angel Funding,307000
79,80,13-02-2019,NanoClean Global,Nanotechnology,Anti-Pollution,New Delhi,"LetsVenture, PitchRight Venture, 91SpringBoard...",Series A,600000


# Applying a function to pandas series/dataframe

### Using the map() function to map existing values of a series to create different set of values

In [22]:
# mapping 'Series A' type funding to 0 and 'Series B' funding to 1

startup['Fund_Map'] = startup.Investment_Type.map({'Series A':0,'Series B':1})

startup.loc[20:30, ['Startup_Name','Fund_Map']]

Unnamed: 0,Startup_Name,Fund_Map
20,INDwealth,
21,HungerBox,
22,AdmitKard,
23,Mishry Reviews,0.0
24,Grofers,
25,Rapido Bike Taxi,1.0
26,RenewBuy,1.0
27,Atlan,
28,WizCounsel,
29,Ola Cabs,


<a href="https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.map.html">**map() documentation**</a>

In [29]:
# Loading another dataset

drinksco = pd.read_csv("drinks.csv")

In [30]:
# applying the max function to obtain maximum value in each row

drinksco.loc[:,'beer_servings':'wine_servings'].apply(max,axis=1).head()

0      0
1    132
2     25
3    312
4    217
dtype: int64

<a href="https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.apply.html">**apply() documentation**</a>

### Applying a function to every element in a dataframe

In [32]:
# converting every element into a float
drinksco.loc[:,'beer_servings':'wine_servings'].applymap(float).head()

Unnamed: 0,beer_servings,spirit_servings,wine_servings
0,0.0,0.0,0.0
1,89.0,132.0,54.0
2,25.0,0.0,14.0
3,245.0,138.0,312.0
4,217.0,57.0,45.0


<a href="https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.applymap.html">**applymap() documentation**</a>