## Name : Shraddha Pandit Pawar 

![donationimg.jpg](attachment:donationimg.jpg)

### Importing libraries and data

In [12]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np


In [13]:
gfm = pd.read_csv('gfm_data.csv')
print(gfm.head())

  Category                                             Title         Location  \
0  Medical                           Justice for Jacob Blake      Kenosha, WI   
1  Medical       Official Navajo Nation COVID-19 Relief Fund  Window Rock, AZ   
2  Medical  Help a front line nurse and baby get proper care     Randolph, NJ   
3  Medical                Rest up, Tommy, we'll see you soon   Scottsdale, AZ   
4  Medical               OFFICIAL BRANDON SAENZ MEDICAL FUND        Tyler, TX   

   Amount_Raised       Goal  Days_of_Fundraising Number_of_Donors FB_Shares  \
0      2297930.0  3000000.0                   93            72.5K      118K   
1      1862040.0  1000000.0                  205            21.9K     71.7K   
2       954793.0  1200000.0                  215            18.3K     16.4K   
3       673179.0  1000000.0                  131            10.3K     21.3K   
4       570529.0   750000.0                  175            24.3K      5.5K   

                                      

In [14]:
gfm['Text']=gfm['Text'].str.lower()
print(gfm)

    Category                                             Title  \
0    Medical                           Justice for Jacob Blake   
1    Medical       Official Navajo Nation COVID-19 Relief Fund   
2    Medical  Help a front line nurse and baby get proper care   
3    Medical                Rest up, Tommy, we'll see you soon   
4    Medical               OFFICIAL BRANDON SAENZ MEDICAL FUND   
..       ...                                               ...   
833   Wishes        Juneteenth Women of Color Scholarship Fund   
834   Wishes        All-Terrain Wheelchair for Benjamin Wimett   
835   Wishes                                 Martha's Daughter   
836   Wishes                            Financial Difficulties   
837   Wishes                            Our Roots - Yahchouche   

             Location  Amount_Raised       Goal  Days_of_Fundraising  \
0         Kenosha, WI      2297930.0  3000000.0                   93   
1     Window Rock, AZ      1862040.0  1000000.0                

### Data Explaining

       1. Category: Describes the type of fundraiser. There are 14 different categories of  fundraiser represented in our dataset: 'Animals', 'Business', 'Community', 'Competition',  'Creative', 'Emergency', 'Event', 'Faith', 'Family', 'Medical', 'Memorial', 'Newlywed',  'Sports', and 'Wishes'. 
    
       2. Title: The title of the fundraiser, which describes the fundraiser in a few words. 3. Location: The location where the fundraiser takes place, stored as "city, state", e.g. "San  Diego, CA". 
    
       4. Amount_Raised: The total amount of money donated, measured in dollars. 5. Goal: The total amount of money that the fundraiser originally hopes to receive  (measured in dollars). 
    
       6. Days_of_Fundraising: The number of days for which the fundraiser has been actively  soliciting donations. 
    
       7. Number_of_Donors: The number of people who donated to the fundraiser.  8. FB_Shares: The number of times the fundraiser has been shared through Facebook.  9. Text: The text description that accompanies the fundraiser, usually describing why funds  are being raised. 
    
       In the Number_of_Donors column and the FB_Shares column, entries are stored as strings, and  large numbers are abbreviated by using "K" to indicate thousands. For example, an entry 3K would correspond to 3 * 1000 = 3000. 
    
       Note that much of the data for each fundraiser is determined by the user who creates the  campaign. When creating a campaign, the user selects an appropriate category and title, chooses  the location, sets the fundraising goal, and adds the text description.  
    
       As a data scientist, data cleaning is where you will spend a significant portion of your time. This  project is no different! 
    
       Let's start by examining the Number_of_Donors and FB_Shares columns. The values in these  columns are strings, not integers, so we can't actually do things like compare values numerically.  Moreover, some values are in thousands, such as '72.5K', and others are fewer than 1000, such as  '366'. Let's get this data in a more usable format. 


### Data cleaning

###     Question 1 Define a function named convert_units which converts a string representation of a  number to an integer. This function should work whether or not the string representation has a  ‘K’ in it (representing thousands).

In [15]:
gfm['FB_Shares']=gfm['FB_Shares'].map(lambda st: float(st.replace("K","")))
gfm['FB_Shares']=gfm['FB_Shares']*1000


gfm['Number_of_Donors']=gfm['Number_of_Donors'].map(lambda st: float(st.replace("K","")))
gfm['Number_of_Donors']=gfm['Number_of_Donors']*1000

gfm.head()


Unnamed: 0,Category,Title,Location,Amount_Raised,Goal,Days_of_Fundraising,Number_of_Donors,FB_Shares,Text
0,Medical,Justice for Jacob Blake,"Kenosha, WI",2297930.0,3000000.0,93,72500.0,118000.0,on august 23rd my son was shot multiple times ...
1,Medical,Official Navajo Nation COVID-19 Relief Fund,"Window Rock, AZ",1862040.0,1000000.0,205,21900.0,71700.0,the navajo nation covid-19 fund has been estab...
2,Medical,Help a front line nurse and baby get proper care,"Randolph, NJ",954793.0,1200000.0,215,18300.0,16400.0,"on sunday, april 12, sylvia leroy, a pregnant ..."
3,Medical,"Rest up, Tommy, we'll see you soon","Scottsdale, AZ",673179.0,1000000.0,131,10300.0,21300.0,"first, thank you for being here. tommy rivers ..."
4,Medical,OFFICIAL BRANDON SAENZ MEDICAL FUND,"Tyler, TX",570529.0,750000.0,175,24300.0,5500.0,my name is melissa green and i am the mother o...


### Introductory Details About Data 

In [None]:
gfm.head()

In [None]:
gfm.tail()

In [None]:
gfm.shape

In [None]:
#gfm.shape
gfm.info()

In [None]:
print(gfm.columns)

### Statistical Insights

In [None]:
gfm.describe()

### Data cleaning- Checking for null values 

checking missing values

In [None]:
gfm.isna().sum                                      #gives the number of missing values for each variable

### Removing Null Entries 

In [None]:
gfm.dropna(axis=0 , inplace=True)                   #  If null entries are there
gfm.shape

### Filling values in place of Null Entries(If Numerical feature)


Values can either be mean, median or any integer

In [None]:
data = gfm.drop_duplicates(subset ="Category")
data.head()

In [None]:
gfm["Category"].value_counts()

### Handling Outliers

In [None]:
sns.boxplot(data=gfm)

In [None]:
sns.boxplot(x = 'Amount_Raised' , data=gfm)

In [None]:
sns.boxplot(x = 'Goal' , data=gfm)

In [None]:
sns.boxplot(x = 'Days_of_Fundraising' , data=gfm)

In [None]:
#sns.boxplot(x = 'Number_of_Donors' , data=gfm)

In [None]:
sns.boxplot(x = 'FB_Shares' , data=gfm)

### Removing Outliers 

For Amount_Raised

In [None]:
# IQR
Q1 = np.percentile(gfm['Amount_Raised'], 25,
                interpolation = 'midpoint')
  
Q3 = np.percentile(gfm['Amount_Raised'], 75,
                interpolation = 'midpoint')
IQR = Q3 - Q1


print("Old Shape: ", gfm.shape)
  
# Upper bound
upper = np.where(gfm['Amount_Raised'] >= (Q3+1.5*IQR))
  
# Lower bound
lower = np.where(gfm['Amount_Raised'] <= (Q1-1.5*IQR))
  
# Removing the Outliers
gfm.drop(upper[0], inplace = True)
gfm.drop(lower[0], inplace = True)
  
print("New Shape: ", gfm.shape)
  
sns.boxplot(x='Amount_Raised', data=gfm)

### QUESTION 2
### Overwrite the columns for Number_of_Donors and FB_Shares with columns of the same name, but contain integers instead strings, The dataframe should still be names gfm_data.

In [16]:
gfm_data = pd.DataFrame(gfm['Number_of_Donors'])
print(gfm_data)

     Number_of_Donors
0             72500.0
1             21900.0
2             18300.0
3             10300.0
4             24300.0
..                ...
833          366000.0
834          279000.0
835          318000.0
836          148000.0
837           45000.0

[838 rows x 1 columns]


In [17]:
gfm_data['FB_Shares'] = gfm['FB_Shares']
print(gfm_data)

     Number_of_Donors  FB_Shares
0             72500.0   118000.0
1             21900.0    71700.0
2             18300.0    16400.0
3             10300.0    21300.0
4             24300.0     5500.0
..                ...        ...
833          366000.0   214000.0
834          279000.0     2500.0
835          318000.0   868000.0
836          148000.0   379000.0
837           45000.0   242000.0

[838 rows x 2 columns]


###    Question 3. Create a new dataframe called gfm_campaigns that has the same columns as gfm,  plus two more: 
#### 1. Proportion_Raised, which has the proportion of the overall goal that the campaign  raised. This proportion should be rounded (not truncated) to 3 decimal places. 
#### 2. Average_Donation_Amount, which has the average donation amount per donor. Since  this is a dollar amount, it should be rounded (not truncated) to 2 decimal places. 


In [18]:
proportion_raised=gfm['Amount_Raised']/gfm['Goal']
proportion_raised

0      0.765977
1      1.862040
2      0.795661
3      0.673179
4      0.760705
         ...   
833    0.867400
834    1.135294
835    0.071450
836    0.234357
837    0.358100
Length: 838, dtype: float64

In [19]:
Average_Donation_Amount=gfm['Amount_Raised']/gfm['Number_of_Donors']
Average_Donation_Amount

0      31.695586
1      85.024658
2      52.174481
3      65.357184
4      23.478560
         ...    
833     0.047399
834     0.069176
835     0.067406
836     0.110845
837     0.397889
Length: 838, dtype: float64

In [20]:
gfm_data['FB_Shares']=gfm['FB_Shares']

In [21]:
d = {"Number_of_Donors":gfm_data['Number_of_Donors'],"FB_Shares":gfm_data['FB_Shares'],"proportion_raised":proportion_raised,"Average_Donation_Amount":Average_Donation_Amount}
d

{'Number_of_Donors': 0       72500.0
 1       21900.0
 2       18300.0
 3       10300.0
 4       24300.0
          ...   
 833    366000.0
 834    279000.0
 835    318000.0
 836    148000.0
 837     45000.0
 Name: Number_of_Donors, Length: 838, dtype: float64,
 'FB_Shares': 0      118000.0
 1       71700.0
 2       16400.0
 3       21300.0
 4        5500.0
          ...   
 833    214000.0
 834      2500.0
 835    868000.0
 836    379000.0
 837    242000.0
 Name: FB_Shares, Length: 838, dtype: float64,
 'proportion_raised': 0      0.765977
 1      1.862040
 2      0.795661
 3      0.673179
 4      0.760705
          ...   
 833    0.867400
 834    1.135294
 835    0.071450
 836    0.234357
 837    0.358100
 Length: 838, dtype: float64,
 'Average_Donation_Amount': 0      31.695586
 1      85.024658
 2      52.174481
 3      65.357184
 4      23.478560
          ...    
 833     0.047399
 834     0.069176
 835     0.067406
 836     0.110845
 837     0.397889
 Length: 838, dtype: float64}

In [22]:
gfm_campaign=pd.DataFrame(d)
gfm_campaign

Unnamed: 0,Number_of_Donors,FB_Shares,proportion_raised,Average_Donation_Amount
0,72500.0,118000.0,0.765977,31.695586
1,21900.0,71700.0,1.862040,85.024658
2,18300.0,16400.0,0.795661,52.174481
3,10300.0,21300.0,0.673179,65.357184
4,24300.0,5500.0,0.760705,23.478560
...,...,...,...,...
833,366000.0,214000.0,0.867400,0.047399
834,279000.0,2500.0,1.135294,0.069176
835,318000.0,868000.0,0.071450,0.067406
836,148000.0,379000.0,0.234357,0.110845


In [23]:
gfm_campaign.round({"proportion_raised":3,"Average_Donation_Amount":2})

Unnamed: 0,Number_of_Donors,FB_Shares,proportion_raised,Average_Donation_Amount
0,72500.0,118000.0,0.766,31.70
1,21900.0,71700.0,1.862,85.02
2,18300.0,16400.0,0.796,52.17
3,10300.0,21300.0,0.673,65.36
4,24300.0,5500.0,0.761,23.48
...,...,...,...,...
833,366000.0,214000.0,0.867,0.05
834,279000.0,2500.0,1.135,0.07
835,318000.0,868000.0,0.071,0.07
836,148000.0,379000.0,0.234,0.11


###   Question 4. Create a new dataframe called gfm_success, which has all the information from the  gfm_campaigns dataframe plus a new column titled How_Successful that indicates how  successful a campaign was, depending on its Proportion_Raised. If x is the proportion raised,  we use the table below to define how successful a campaign is. 

![image.png](attachment:image.png)

In [None]:
#gfm_campaigns["How_successful"] = "154102.1"
#gfm_caHow_successfulmpaigns.head()
#n=len()
#n

In [24]:
How_Successful=[]
for x in gfm_campaign['proportion_raised']:
    if 0.0<x<0.20:
        How_Successful.append("highy unsuccessful")
    elif 0.20<x<0.50:
        How_Successful.append("moderately unsuccessful")
    elif 0.50<x<0.80:
        How_Successful.append("moderately successful")
    elif 0.80<x<1.0:
        How_Successful.append("highly successful")
    else:
        How_Successful.append("extremly successfull")
                                    

In [25]:
e = {"Number_of_Donors":gfm_campaign['Number_of_Donors'],"FB_Shares":gfm_campaign['FB_Shares'],"proportion_raised":gfm_campaign['proportion_raised'],"Average_Donation_Amount":gfm_campaign['Average_Donation_Amount'],"How_Successful":How_Successful}
gfm_success = pd.DataFrame(e)
gfm_success

Unnamed: 0,Number_of_Donors,FB_Shares,proportion_raised,Average_Donation_Amount,How_Successful
0,72500.0,118000.0,0.765977,31.695586,moderately successful
1,21900.0,71700.0,1.862040,85.024658,extremly successfull
2,18300.0,16400.0,0.795661,52.174481,moderately successful
3,10300.0,21300.0,0.673179,65.357184,moderately successful
4,24300.0,5500.0,0.760705,23.478560,moderately successful
...,...,...,...,...,...
833,366000.0,214000.0,0.867400,0.047399,highly successful
834,279000.0,2500.0,1.135294,0.069176,extremly successfull
835,318000.0,868000.0,0.071450,0.067406,highy unsuccessful
836,148000.0,379000.0,0.234357,0.110845,moderately unsuccessful


In [None]:
gfm_campaigns['How_Successful'] = round(1.156295e+05 , 2)
gfm_campaigns

In [None]:
for ind, row in gfm_campaigns.iterrows():
    gfm_campaigns.loc[ind, "How_Successful"] = len(row['Text'])

###     Question 5. Create a new dataframe called gfm, which has all the information from the  gfm_success dataframe plus a new column called Num_Chars with the length of the text  description for each campaign, as an int. 

In [29]:
gfm['Num_Chars']=gfm['Text'].str.len()
f={"Number_of_Donors":gfm_data['Number_of_Donors'],"FB_Shares":gfm_data['FB_Shares'],"proportion_raised":proportion_raised,"Average_Donation_Amount":Average_Donation_Amount,"How_Successful":How_Successful,"Text":gfm['Text'],"Num_Chars":gfm['Num_Chars']}
gfm=pd.DataFrame(f)


In [30]:
gfm_campaign.head()

Unnamed: 0,Number_of_Donors,FB_Shares,proportion_raised,Average_Donation_Amount
0,72500.0,118000.0,0.765977,31.695586
1,21900.0,71700.0,1.86204,85.024658
2,18300.0,16400.0,0.795661,52.174481
3,10300.0,21300.0,0.673179,65.357184
4,24300.0,5500.0,0.760705,23.47856


### Question 6. Overwrite the Text column in the gfm dataframe so that all the descriptions appear  in lowercase. 

In [31]:
gfm['Text']=gfm['Text'].str.lower()
print(gfm['Text'])

0      on august 23rd my son was shot multiple times ...
1      the navajo nation covid-19 fund has been estab...
2      on sunday, april 12, sylvia leroy, a pregnant ...
3      first, thank you for being here. tommy rivers ...
4      my name is melissa green and i am the mother o...
                             ...                        
833    the need for community and collective healing ...
834    i am 35 years old and was born with cerebral p...
835    greetings,my name is nyanyika banda. i am a ch...
836    my wife, roni, and i have faced many  health c...
837    no matter where life has lead us, our past for...
Name: Text, Length: 838, dtype: object


### Question 7. Set cutoff to be the smallest number of characters for which we will consider a text  description to be long, if we want the groups to be about the same size.  

In [34]:
for ind, row in gfm.iterrows():
    gfm.loc[ind, "impactful_words"] = len(row['Text']) > 2400.142857

In [35]:
gfm.head()

Unnamed: 0,Number_of_Donors,FB_Shares,proportion_raised,Average_Donation_Amount,How_Successful,Text,Num_Chars,impactful_words
0,72500.0,118000.0,0.765977,31.695586,moderately successful,on august 23rd my son was shot multiple times ...,964,False
1,21900.0,71700.0,1.86204,85.024658,extremly successfull,the navajo nation covid-19 fund has been estab...,824,False
2,18300.0,16400.0,0.795661,52.174481,moderately successful,"on sunday, april 12, sylvia leroy, a pregnant ...",1651,False
3,10300.0,21300.0,0.673179,65.357184,moderately successful,"first, thank you for being here. tommy rivers ...",1461,False
4,24300.0,5500.0,0.760705,23.47856,moderately successful,my name is melissa green and i am the mother o...,1241,False


In [None]:
gfm_campaigns.describe()

### Question 8. Create a new dataframe called gfm_impactful, which contains the information from gfm plus a column named Has_Impactful which is True/False depending whether the campaign contains  at least one impactful word or not. 

In [40]:
gfm_impactful = pd.DataFrame(gfm)
gfm_impactful.head()

Unnamed: 0,Number_of_Donors,FB_Shares,proportion_raised,Average_Donation_Amount,How_Successful,Text,Num_Chars,impactful_words,Has_Impactful
0,72500.0,118000.0,0.765977,31.695586,moderately successful,on august 23rd my son was shot multiple times ...,964,False,154102.1
1,21900.0,71700.0,1.86204,85.024658,extremly successfull,the navajo nation covid-19 fund has been estab...,824,False,154102.1
2,18300.0,16400.0,0.795661,52.174481,moderately successful,"on sunday, april 12, sylvia leroy, a pregnant ...",1651,False,154102.1
3,10300.0,21300.0,0.673179,65.357184,moderately successful,"first, thank you for being here. tommy rivers ...",1461,False,154102.1
4,24300.0,5500.0,0.760705,23.47856,moderately successful,my name is melissa green and i am the mother o...,1241,False,154102.1
